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Abstract In the simplest sequential decision problem for an ergodic stochastic pro- 
cess X, at each time n a decision u n is made as a function of past observations 
Xq, . . . ,X„_i, and a loss l(u„,X„) is incurred. In this setting, it is known that one 
£jT) ma y choose (under a mild integrability assumption) a decision strategy whose path- 

■ wise time-average loss is asymptotically smaller than that of any other strategy. The 

corresponding problem in the case of partial information proves to be much more 
delicate, however: if the process X is not observable, but decisions must be based on 
the observation of a different process Y, the existence of pathwise optimal strategies 
is not guaranteed. The aim of this paper is to exhibit connections between pathwise 
optimal strategies and notions from ergodic theory. The sequential decision problem 
is developed in the general setting of an ergodic dynamical system (£2 , 23 , P, T ) with 
partial information y C 23. The existence of pathwise optimal strategies grounded in 
two basic properties: the conditional ergodic theory of the dynamical system, and the 
complexity of the loss function. When the loss function is not too complex, a gen- 
eral sufficient condition for the existence of pathwise optimal strategies is that the 
dynamical system is a conditional Zf-automorphism relative to the past observations 
V„>or n y. If the conditional ergodicity assumption is strengthened, the complexity 
assumption can be weakened. Several examples demonstrate the interplay between 
complexity and ergodicity, which does not arise in the case of full information. Our 
results also yield a decision-theoretic characterization of weak mixing in ergodic 
theory, and establish pathwise optimality of ergodic nonlinear filters. 



1 Introduction 

Let X = {X^kez be a stationary and ergodic stochastic process. A decision maker 
must select at the beginning of each day k a decision depending on the past 
observations Xq, . . . At the end of the day, a loss is incurred. The 

decision maker would like to minimize her time-average loss 
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How should she go about selecting a decision strategy u = (wjfc)i>i? 

There is a rather trivial answer to this question. Taking the expectation of the 
time-average loss, we obtain for any strategy u using the tower property 



E[L r (u)] =E 



>E 
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:E[L r (u)], 



where u = (uk)k>\ is defined as = argmin i( E[/(M,X,t)|Xo, . . . (we disregard 

for the moment integrability and measurability issues, existence of minima, and the 
like; such issues will be properly addressed in our results). Therefore, the strategy u 
minimizes the mean time-average loss E[Lj-(u)]. 

However, there are conceptual reasons to be dissatisfied with this obvious so- 
lution. In many decision problems, one only observes a single sample path of the 
process X. For example, if is the return of a financial market in day k and Lt(vl) 
is the loss of an investment strategy u, only one sample path of the model is ever 
realized: we do not have the luxury of averaging our investment loss over multiple 
"alternative histories". The choice of a strategy for which the mean loss is small 
does not guarantee, a priori, that it will perform well on the one and only realiza- 
tion that happens to be chosen by nature. Similarly, if models the state of the 
atmosphere and Lj (u) is the error of a weather prediction strategy, we face a sim- 
ilar conundrum. In such situations, the use of stochastic models could be justified 
by some sort of ergodic theorem, which states that the mean behavior of the model 
with respect to different realizations captures its time-average behavior over a sin- 
gle sample path. Such an ergodic theorem for sequential decisions was obtained by 
Algoet fl , Theorem 2] under a mild integrability assumption. 

Theorem 1.1 (Algoet[l|). Suppose that \l(u,x)\ <A(x) with A E LlogL. Then 

liminf{L7-(u) — Lj(n)} > a.s. 

for every strategy u: that is, the mean-optimal strategy u is pathwise optimal. 

The proof of this result follows from a simple martingale argument. What is 
remarkable is that the details of the model do not enter the picture at all: nothing 
is assumed on the properties of X or / beyond some integrability (ergodicity is not 
needed, and a similar result holds even in the absence of stationarity, cf. |fl] Theorem 
3]). This provides a universal justification for optimizing the mean loss: the much 
stronger pathwise optimality property is obtained "for free." 

In the proof of Theorem ll.il it is essential that the decision maker has full infor- 
mation on the history Xq, . . . ,Xyt_i of the process X. However, the derivation of the 
mean-optimal strategy can be done in precisely the same manner in the more gen- 
eral setting where only partial or noisy information is available. To formalize this 
idea, let Y = {Jk)ke"L be the stochastic process observable by the decision maker, 
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and suppose that the pair (X,Y) is stationary and ergodic. The loss incurred at 
time k is still l{u^Xk), but now may depend on the observed data Yq, . . . 
only. It is easily seen that in this setting, the mean-optimal strategy u is given by 
Uk = argmin u E[Z(z<,Xifc)|Ioj---,I*-i]> and it is tempting to assume that u is also 
pathwise optimal. Surprisingly, this is very far from being the case. 

Example 1.2 (Weissman and Merhav 02l/ ). Let Xq ~ Bernoulli (1/2) and let = 
1 — and Y/c — for all k. Then (X,Y) is stationary and ergodic: Y^ = indicates 
that we are in the setting of no information (that is, we must make blind decisions). 
Consider the loss l(u,x) = (u — x) 2 . Then the mean-optimal strategy % = 1/2 sat- 
isfies Lt(u) = 1/4 for all T. However, the strategy = £mod2 satisfies Ly(u) = 
for all T with probability 1/2. Therefore, u is not pathwise optimal. In fact, it is 
easily seen that no pathwise optimal strategy exists. 

Example 1 1 . 21 illustrates precisely the type of conundrum that was so fortuitously 
ruled out in the full information setting by Theorem l 1.11 Indeed, it would be hard to 
argue that either u or u in Example |1.2| is superior: a gambler placing blind bets 
on a sequence of games with loss liu^X^) may prefer either strategy depending on 
his demeanor. The example may seem somewhat artificial, however, as the hidden 
process X has infinitely long memory; the gambler can therefore beat the mean- 
optimal strategy by simply guessing the outcome of the first game. But precisely the 
same phenomenon can appear when (X,Y) is nearly memoryless. 

Example 1.3. Let (<^)*eZ be i.i.d. Bernoulli(l/2), and letX^ = (^-l,^*) an d Yk = 
\%k — | for all k. Then (X,Y) is a stationary 1-dependent sequence: (X^jY^kKn 
and (X^, l^)/t>«+2 are independent for every k. We consider the loss l(u,x) = (u — 
x\) 2 . It is easily seen thatX^ is independent of Y\ , . . . , I^—i , so that the mean-optimal 
strategy = 1/2 satisfies Lt(u) = 1 /4 for all T. On the other hand, note that ^-l = 

(4o + Y i H hljfc— i) mod 2. It follows that the strategy u k = (Y { -\ h Y k _i ) mod 2 

satisfies Lj-(u) = for all T with probability 1 /2. 

Evidently, pathwise optimality cannot be taken for granted in the partial infor- 
mation setting even in the simplest of examples: in contrast to the full information 
setting, the existence of pathwise optimal strategies depends both on specific ergod- 
icity properties of the model (X,Y) and (as will be seen later) on the complexity 
on the loss /. What mechanism is responsible for pathwise optimality under partial 
information is not very well understood. Weissman and Merhav (32 1, who initiated 
the study of this problem, give a strong sufficient condition in the binary setting. 
Little is known beyond their result, beside one particularly special case of quadratic 
loss and additive noise considered by Nobel [24 1 Q 

1 It should be noted that the papers jTj |32| [24l, in addition to studying the pathwise optimality 
problem, also aim to obtain universal decision schemes that achieve the optimal asymptotic loss 
without any knowledge of the law of X (note that to compute the mean-optimal strategy u one must 
know the joint law of (X,Y)). Such strategies "learn" the law of X on the fly from the observed 
data. In the setting of partial information, such universal schemes cannot exist without very specific 
assumptions on the information structure: for example, in the blind setting (cf. Example 1 1.21 . there 
is no information and thus universal strategies cannot exist. What conditions are required for the 
existence of universal strategies is an interesting question that is beyond the scope of this paper. 
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The aim of this paper is twofold. On the one hand, we will give general conditions 
for pathwise optimality under partial information, and explore some tradeoffs inher- 
ent in this setting. On the other hand, we aim to exhibit some connections between 
the pathwise optimality problem and certain notions and problems in ergodic the- 
ory, such as conditional mixing and individual ergodic theorems for subsequences. 
To make such connections in their most natural setting, we begin by rephrasing the 
decision problem in the general setting of ergodic dynamical systems. 



1.1 The dynamical system setting 

Let T be an invertible measure-preserving transformation of a probability space 
(12,23,?). T defines the time evolution of the dynamical system (f2,23,P,r): if the 
system is initially in state ft) S Q, then at time k the system is in the state T k a>. 
The state of the system is not directly observable, however. To model the available 
information, we fix a cr-field y C 23 of events that can be observed at a single time. 
Therefore, if we have observed the system in the time interval [m, n] , the information 
contained in the observations is given by the CJ-field y m .„ = Vjfce[m,n] T~ ky S. 

In this general setting, the decision problem is defined as follows. Let I : 
U x £2 — > R be a given loss function, where U is the set of possible decisions. 
At each time k, a decision is made and a loss ik{uk) '■= £(uk,T k (o) is incurred. 
The decision can only depend on the observations: that is, a strategy u = {uk)k>\ is 
admissible if Uk is ^o^-measurable for every k. The time-average loss is given by 

1 T 

M u ) := -£4(k*)- 
1 k=\ 

The basic question we aim to answer is whether there exists a pathwise optimal 
strategy, that is, a strategy u* such that for every admissible strategy u 

liminf{L r (u) -L T {u*)} > a.s. 

The stochastic process setting discussed above can be recovered as a special case. 

Example 1.4. Let (X,Y) be a stationary and ergodic stochastic process, where X^ 
takes values in the measurable space (£,£) and takes values in the measurable 
space (F, 3 r ). We can realize (X,Y) as the coordinate process on the canonical path 
space (£2,23,P) where Q = E z x F z , 13 = £ z <g> and P is the law of(X,Y). Let 
T : Q —> Q be the canonical shift (T(x,y))„ = (x n+ \,y n+ \). Then (Q,H,P,T) is an 
ergodic dynamical system. If we choose the observation d-field y = crjio} and the 
loss t(u, co) = I (u,X\ (ft))), we recover the decision problem with partial information 
for the stochastic process (X,Y) as it was introduced above. More generally, we 
could let the loss depend arbitrarily on future or past values of (X,Y). 
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Let us briefly discuss the connection between pathwise optimal strategies and 
classical ergodic theorems. The key observation in the derivation of the mean- 
optimal strategy it* = argmin u E[4(«)|yo,A:] i s that by the tower property 



E 



k=l 



E 



-£E[4(«*)|y 
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As the summands on the right-hand side depend only on the observed information, 
we can minimize inside the sum to obtain the mean-optimal strategy u. Precisely 
the same considerations would show that u is pathwise optimal if we could prove 
the ergodic counterpart of the tower property of conditional expectations 



^0 



^£{4("*)-E[4("*)M} 

1 k=i 

The validity of such a statement is far from obvious, however. 

In the special case of blind decisions (that is, y is the trivial <7-field) the "ergodic 
tower property" reduces to the question of whether, given fk{(0) '■= d(uk, 0}), 



a.s. 



if {f k -W k ]} 
1 k=\ 

If the functions fa do not depend on k, this is precisely the individual ergodic theo- 
rem. However, an individual ergodic theorem need not hold for arbitrary sequences 
fa. Special cases of this problem have long been investigated in ergodic theory. For 
example, if fa = a^ f for some fixed function / and bounded sequence {at) C K, 
the problem reduces to a weighted individual ergodic theorem, see [2| and the ref- 
erences therein. If 6 {0,1} for all k, the problem reduces further to an individual 
ergodic theorem along a subsequence (at least if the sequence has positive density), 
cf. J6]|3 and the references therein. A general characterization of such ergodic prop- 
erties does not appear to exist, which suggests that it is probably very difficult to 
obtain necessary and sufficient conditions for pathwise optimality. The situation is 
better for mean (rather than individual) ergodic theorems, cf. O and the references 
therein, and we will also obtain more complete results in a weaker setting. 

The more interesting case where the information y is nontrivial provides addi- 
tional complications. In this situation, the "ergodic tower property" could be viewed 
as a type of conditional ergodic theorem, in between the individual ergodic theorem 
and Algoet's result (T). Our proofs are based on an elaboration of this idea. 



1.2 Some representative results 



The essence of our results is that, when the loss £ is not too complex, pathwise op- 
timal strategies exist under suitable conditional mixing assumptions on the ergodic 
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dynamical system (Q,H,P,T). To this end, we introduce conditional variants of 
two standard notions in ergodic theory: weak mixing and /f-automorphisms. 

Definition 1.5. An invertible dynamical system (i2,23,P,r) is said to be condition- 
ally weak mixing relative to a c-field Z if for every A, B £ 23 

1 T 

- £ \P[Ar\T k B\Z] -P[A\Z]P[T k B\Z]\ inL 1 . 

T k=i 

Definition 1.6. An invertible dynamical system (i2,23,P,r) is called a conditional 
K- automorphism relative to a a-field Z C 23 if there is a c-field X C 23 such that 

i. Xcr'i 

2 - VLi T~ k 1 = 23 mod P. 
3. [^ l {ZVT k X) = Z mod P. 

When the <7-field Z is trivial, these definitions reduced to the usual notions of 
weak mixing and /f-automorphism, cf. [31 J. Similar conditional mixing conditions 
also appear in the ergodic theory literature, see l26l and the references therein. 

An easily stated consequence of our main results, for example, is the following. 

Theorem 1.7. Suppose that (i2,23,P,r) is a conditional K-automorphism relative 
to y^oo.o- Then the mean-optimal strategy u is pathwise optimal for every loss func- 
tion £:U x Q. — » R such that U is finite and \£(u, co) \ < A (a>) with A EL 1 . 

This result gives a general sufficient condition for pathwise optimality when the 
decision space U is finite. In the stochastic process setting (Example 1 1.4l i. the con- 
ditional A'-property would follow from the validity of the a-field identity 

,oVX__ _*)=■«— ,o modP, 

k=\ 

where X_oo # = o{Xj : i < k] (choose X := X_«, o V V-oo.o in Definition 11.6b . In 
the Markovian setting, this is a familiar identity in filtering theory: it is precisely the 
necessary and sufficient condition for the optimal filter to be ergodic, see section [3~3l 
below. Our results therefore lead to a new pathwise optimality property of nonlinear 
filters. Conversely, results from filtering theory yield a broad class of (even non- 
Markovian) models for which the conditional -property can be verified Ifl4ll27l . 
It is interesting to note that despite the apparent similarity between the conditions 
for filter ergodicity and pathwise optimality, there appears to be no direct connec- 
tion between these phenomena, and their proofs are entirely distinct. Let us also 
note that, in the full information setting (Y^ = X^) the conditional ^-property holds 
trivially, which explains the deceptive simplicity of Algoet's result. 



2 To be precise, our definitions are time-reversed with respect to the textbook definitions; however, 
T is a ^-automorphism if and only if r~' is a AT-automorphism 1311 p. 1 10], and the corresponding 
statement for weak mixing is trivial. Therefore, our definitions are equivalent to those in 1311 . 
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While the conditional ergodicity assumption of Theorem 11.71 is quite general, 
the requirement that the decision space U is finite is a severe restriction on the 
complexity of the loss function I. We have stated Theorem 11.71 here in order to 
highlight the basic ingredients for the existence of a pathwise optimal strategy. The 
assumption that U is finite will be replaced by various complexity assumptions on 
the loss l\ such extensions will be developed in the sequel. While some complexity 
assumption on the loss is needed in the partial information setting, there is a tradeoff 
between the complexity and ergodicity: if the notion of conditional ergodicity is 
strengthened, then the complexity assumption on the loss can be weakened. 

All our pathwise optimality results are corollaries of a general master theorem, 
Theorem l2.6l below. that ensures the existence of a pathwise optimal strategy under 
a certain uniform version of the /^-automorphism property. However, in the absence 
of further assumptions, this theorem does not ensure that the mean-optimal strat- 
egy u is in fact pathwise optimal: the pathwise optimal strategy constructed in the 
proof may be difficult to compute. We do not know, in general, whether it is possi- 
ble that a pathwise optimal strategy exists, while the mean-optimal strategy fails to 
be pathwise optimal. In order to gain further insight into such questions, we intro- 
duce another notion of optimality that is intermediate between pathwise and mean 
optimality. A strategy u* is said to be weakly pathwise optimal if 

P[Lr(u) - L r (u*) > -e] 1 for every e > 0. 

It is not difficult to show that if a weakly pathwise optimal strategy exists, then 
the mean-optimal strategy u must also be weakly pathwise optimal. However, the 
notion of weak pathwise optimality is distinctly weaker than pathwise optimality. 
For example, we will prove the following counterpart to Theorem |1.7| 

Theorem 1.8. Suppose that (f2,25,P,r) is conditionally weak mixing relative to 
y_oo,rj. Then the mean-optimal strategy u is weakly pathwise optimal for every loss 
function I : U x £2 — > R such that U is finite and |£(m, ft)) | < A (ft)) with A 6 L . 

There is a genuine gap between Theorems ll.8l and ll.7l in fact, a result of Conze 
O on individual ergodic theorems for subsequences shows that there is a loss func- 
tion £ such that for a generic (in the weak topology) weak mixing system, a mean- 
optimal blind strategy u fails to be pathwise optimal. 

While weak pathwise optimality may not be as conceptually appealing as path- 
wise optimality, the weak pathwise optimality property is easier to characterize. In 
particular, we will show that the conditional weak mixing assumption in Theorem 
II . 81 is not only sufficient, but also necessary, in the special case that ^ is an invari- 
ant (7-field (that is, y = T Invariance of ^ is somewhat unnatural in decision 
problems, as it implies that no additional information is gained over time as more 
observations are accumulated. On the other hand, invariance of Z in Definitions 1 1.5 1 
and ll .6l is precisely the situation of interest in applications of conditional mixing in 
ergodic theory (e.g., 1261 ). The interest of this result is therefore that it provides a 
decision-theoretic characterization of the (conditional) weak mixing property. 
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1.3 Organization of this paper 

The remainder of the paper is organized as follows. In section [2] we state and dis- 
cuss the main results of this paper. We also give a number of examples that illustrate 
various aspects of our results. Our main results require two types of assumptions: 
conditional mixing assumptions on the dynamical system, and complexity assump- 
tions on the loss. In section [3] we discuss various methods to verify these assump- 
tions, as well as further examples and consequences (such as pathwise optimality of 
nonlinear filters). Finally, the proofs of our main results are given in section|4] 



2 Main results 

2.1 Basic setup and notation 

Throughout this paper, we will consider the following setting: 

• (£2,23,P) is a probability space. 

• y C 23 is a sub-<7-field. 

• T : £2 — > Q is an invertible measure-preserving ergodic transformation. 

• (U,U) is a measurable space. 

As explained in the introduction, we aim to make sequential decisions in the ergodic 
dynamical system (f2,CE>,P,r). The decisions take values in the decision space U, 
and the d-field y represents the observable part of the system. We define 

n 

y m , n = v r ~ A 'y for -°° ^ m < » < °°, 

that is, y„,.„ is the c-field generated by the observations in the time interval [m,n]. 
An admissible decision strategy must depend causally on the observed data. 

Definition 2.1. A strategy u = (uk)k>i is called admissible if it is ^cu-adapted, that 
is, Ufc : £2 — y U is yo,<r- measura bl e for every k > 1. 

It will be convenient to introduce the following notation. For every m < n, define 
U m ,n = {« : ^2 — > U : u is ym^-measurable}, U„ = U„ v ,. 

—°°<m<n 

Thus a strategy u is admissible whenever G Uo,t for all k. Note that U„ C U_oo,„: 
this distinction will be essential for the validity of our results. 

To describe the loss of a decision strategy, we introduce a loss function t, 

• £:t/xf2— s-Risa measurable function and \i (u, a)\ < A (to) with A G L 1 . 



Ergodicity, Decisions, and Partial Information 9 

If \£{u, co) | < A (co) with A G IP, the loss is said to be dominated in LP. As indicated 
above, we will always assum^l that our loss functions are dominated in L 1 . 

The loss function £(u, co) represents the cost incurred by the decision u when the 
system is in state ft). In particular, the cost of the decision Uk at time k is given by 
£(uk, T k a>) = £k(uk), where we define for notational simplicity 

£„(u):£2^>R, £„(u)(co) = £(u,T"a). 

Our aim is to select an admissible strategy u that minimizes the time-average loss 

Lr(u) = l£4("*) 
1 k=\ 

in a suitable sense. 

Definition 2.2. An admissible strategy u* is pathwise optimal if 
liminf{L r (u) -L T {u*)} > a.s. 

for every admissible strategy u. 

Definition 2.3. An admissible strategy u* is weakly pathwise optimal if 

P [L T (u) - L T (u* ) > - e] 1 for every e > 
for every admissible strategy u. 

Definition 2.4. An admissible strategy u* is mean optimal if 
liminf{E[L r (u)] -E[L r (u*)]} > 

for every admissible strategy u. 

These notions of optimality are progressively weaker: a pathwise optimal strat- 
egy is clearly weakly pathwise optimal, and a weakly pathwise optimal strategy is 
mean optimal (as the loss function is assumed to be dominated in L ). 

In the introduction, it was stated that Uk = argmin H€E/ E^^m)!^*] defines a 
mean-optimal strategy. This disregards some technical issues, as the argmin may 
not exist or be measurable. It suffices, however, to consider a slight reformulation. 

Lemma 2.5. There exists an admissible strategy u such that 

Efo("*)|V<yt] < essinfE[4(«)|y .t] a.s. 

for every k > 1. In particular, u is mean-optimal. 

3 Non-dominated loss functions may also be of significant interest, see 1241 for example. We will 
restrict attention to dominated loss functions, however, which suffice in many cases of interest. 
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Proof. It follows from the construction of the essential supremum [[25] p. 49] that 
there exists a countable family (U n ) ne ^ C Uo such that 

essinfE[4(«)|yo.*] - inf E[t k {U n )\%. k ]. 



Define the random variable 



T = inf{« : E[£ k (U")\%, k ] < essinfE[4(w)|¥o,*] +k~ l } 

ueV ok 



Note that % < °o a.s. as essinf„ eUot E [4(«)|Vo,it] > -E[A o T k \^ Q , k ] > -°o a.s. We 
therefore define = f/ T . To show that u is mean optimal, it suffices to note that 



E[Lr(u)]-E[Lr(fl)] = ^E\E[e k {u k )\% tk ]-E[£ k (u k )\\ k ] 
1 k=\ L 



1 T 

— T ^ 
1 k=l 



for any admissible strategy u and T > 1. 



□ 



In particular, we emphasize that a mean-optimal strategy u always exists. In the 
remainder of this paper, we will fix a mean-optimal strategy u as in Lemma l231 



2.2 Pathwise optimality 



Our results on the existence of pathwise optimal strategies are all consequences of 
one general result, Theorem l2.61 that will be stated presently. The essential assump- 
tion of this general result is that the properties of the conditional TT-automorphism 
(Definition ! 1.6b hold uniformly with respect to the loss function I. Note that, in prin- 
ciple, the assumptions of this result do not imply that (■Q,®,P,7') is a conditional 
/^-automorphism, though this will frequently be the case. 

Theorem 2.6 (Pathwise optimality). Suppose that for some a-field X C 53 

1. icr'i 

2. The following martingales converge uniformly: 

esssup|E[£ (M)|y-oo,oVr-"X]-£ (M)| ml 1 , 

ueV 

esssup|E[4( M )|y_ oo , vr ! X] -E[4( M )|nr = i(y-oo ;0 vr <: X)]| ^>0 inL 1 . 

mGUq 



3. The remote past does not affect the asymptotic loss: 



V :=E 



essinfE[4(«)|y_„o] 

mGUq 



E 



essinfE[4(«)inr=i (y—.o V T^X)] 
weUo 



Then there exists an admissible strategy u* such that for every admissible strategy u 
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liminf{L7-(u) -L T (u*)} > a.s., limL T (u*)=L* a.s., 

that is, u* is pathwise optimal and L* is the optimal long time-average loss. 

The proof of this result will be given in section l4~T1 below. 
Before going further, let us discuss the conceptual nature of the assumptions of 
Theorem l2.6l The assumptions encode two separate requirements: 

1 . Assumption 3 of Theorem 12.61 should be viewed as a mixing assumption on 
the dynamical system (f2,23,P, T) that is tailored to the decision problem. In- 
deed, y_c*,,o represents the information contained in the observations, while 
niT=i (V-=,0 V T k X) includes in addition the remote past of the generating c-field 
X. The assumption states that knowledge of the remote past of the unobserved 
part of the model cannot be used to improve our present decisions. 

2. Assumption 2 of Theorem l2 . 6 1 should be viewed as a complexity assumption on 
the loss function I. Indeed, in the absence of the essential suprema, these state- 
ments hold automatically by the martingale convergence theorem. The assump- 
tion requires that the convergence is in fact uniform in u 6 Uo- This will be the 
case when the loss function is not too complex. 

The assumptions of Theorem l2.6l can be verified in many cases of interest. In sec- 
tion [3]below, we will discuss various methods that can be used to verify both the 
conditional mixing and the complexity assumptions of Theorem |2.6l 

In general, neither the conditional mixing nor the complexity assumption can be 
dispensed with in the presence of partial information. 

Example 2.7 (Assumption 3 is essential). We have seen in Examples 1 1.2l and ll.3l in 
the introduction that no pathwise optimal strategy exists. In both these examples 
Assumption 2 is satisfied, that is, the loss function is not too complex (this will 
follow from general complexity results, cf. Example l3.6l in section|3]below). On the 
other hand, it is easily seen that the conditional mixing Assumption 3 is violated. 

Example 2.8 (Assumption 2 is essential). Let X = (X^^z be the stationary Markov 
chain in [0, 1] defined by X^+i = (X^ + i ) / 2 for all k, where (e^kez is an i.i.d. 
sequence of Bernoulli(l /2) random variables. We consider the setting of blind de- 
cisions with the loss function £k(u) = [2"^] mod2, u 6 U = N. Note that 

(=0 

We claim that no pathwise optimal strategy can exist. Indeed, consider for fixed 
r > the strategy u r such that u' k = k + r. Then £k(u r k ) = £\- r for all k. Therefore, 

£ Y _ r -limsupL7-(u*) = liminf{L r (u'') -L T (u*)} > a.s. for all r > 
for every pathwise optimal strategy u*. In particular, 
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0= inf£i_ r > limsupL7-(u*) > liminfL7-(u*) > a.s. 



r>0 



As |Ly(u*)| < 1 for all T, it follows by dominated convergence that a pathwise 
optimal strategy u* must satisfy E[Lr(u*)] — > as T — > °°. But clearly E[Lj-(u)] = 
1/2 for every T and strategy u, which entails a contradiction. 

Nonetheless, in this example the dynamical system is a /f-automorphism (even a 
Bernoulli shift), so that that Assumption 3 is easily satisfied. As no pathwise optimal 
strategy exists, this must be caused by the failure of Assumption 2. For example, for 
the natural choice X = o{Xj ( : k < 0}, Assumption 3 holds as f] k T k X is trivial by 
the Kolmogorov zero-one law, but it is easily seen that the second equation of As- 
sumption 2 fails. Note that the function l(u,x) = [2"x\ mod2 becomes increasingly 
oscillatory as u — s- °°; this is precisely the type of behavior that obstructs uniform 
convergence in Assumption 2 (akin to "overfitting" in statistics). 

Example 2.9 (Assumption 2 is essential, continued). In the previous example, path- 
wise optimality fails due to failure of the second equation of Assumption 2. We now 
give a variant of this example where the first equation of Assumption 2 fails. 

Let X = (X^^x be an i.i.d. sequence of Bernoulli(l /2) random variables. We 
consider the setting of blind decisions with the loss function £k(u) = Xk +Il , u €U = 
N. We claim that no pathwise optimal strategy can exist. Indeed, consider for r = 0, 1 
the strategy u r defined by u k = 2' +" +1 - k for 2" < k < 2" +1 , n > 0. Then 



i «-12"' +l -l n ~ l 
^ 1 r\ ; — A ™ — n 



m=0 k=2 m 

Suppose that u* is pathwise optimal. Then 



n n— 1 

E 

m=Q 



HminfEfLHu^ALrfu 1 



-Lr(u*)]>E 



liminf {L T (u° ) A L T (u 1 ) - L T (u*) } 



>0. 



But a simple computation shows that E[L2"-i (u ) ALy-i (u 1 )] converges as n — >• °° 
to a quantity strictly less than 1/2 = E[Lr(u*)], so that we have a contradiction. 

Nonetheless, in this example Assumption 3 and the second line of Assumption 2 
are easily satisfied, e.g., for the natural choice X = (7{Xi; : k < 0}. However, the first 
line of Assumption 2 fails, and indeed no pathwise optimal strategy exists. 

It is evident from the previous examples that an assumption on both conditional 
mixing and on complexity of the loss function is needed, in general, to ensure exis- 
tence of a pathwise optimal strategy. In this light, the complete absence of any such 
assumptions in the full information case is surprising. The explanation is simple, 
however: all assumptions of Theorem l2.6l are automatically satisfied in this case. 

Example 2.10 (Full information). Let X = (X^^x be any stationary ergodic pro- 
cess, and consider the case of full information: that is, we choose the obser- 
vation cr-field y = ct{Xo} and the loss £(u,co) = l(u,X\(co)). Then all assump- 
tions of Theorem 12. 61 are satisfied: indeed, if we choose X = cr{Zj : k < 0}, then 
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^-00,0 = ^-00,0 V T DC for all k > 0, so that Assumption 3 and the second line of 
Assumption 2 hold trivially. Moreover, £o(u) is r~ A 'X-measurable for every u G Uo 
and k > 1, and thus the first line of Assumption 2 holds trivially. It follows that in 
the full information setting, a pathwise optimal strategy always exists. 

In a sense, Theorem 12 . 61 provides additional insight even in the full information 
setting: it provides an explanation as to why the case of full information is so much 
simpler than the partial information setting. Moreover, Theorem 12.61 provides an 
explicit expression for the optimal asymptotic loss L*, which is not given in JT]0 

However, it should be emphasized that Theorem l2 .6 I does not state that the mean- 
optimal strategy u is pathwise optimal; it only guarantees the existence of some 
pathwise optimal strategy u*. In contrast, in the full information setting, Theorem 
1 1.1 1 ensures pathwise optimality of the mean-optimal strategy. This is of practical 
importance, as the mean-optimal strategy can in many cases be computed explicitly 
or by efficient numerical methods, while the pathwise optimal strategy constructed 
in the proof of Theorem |2.6| may be difficult to compute. We do not know whether 
it is possible in the general setting of Theorem l2.6l that a pathwise optimal strategy 
exists, but that the mean-optimal strategy u is not pathwise optimal. Pathwise op- 
timality of the mean-optimal strategy u can be shown, however, under somewhat 
stronger assumptions. The following corollary is proved in section l4~2l below. 

Corollary 2.11. Suppose that for some a-field X C 23 

1. xcr'i 

2. The following martingales converge uniformly: 
esssup|E[4(M)|y_„oVr""X]-£o(«)| inL 1 , 

«GUo 

esssup|E[4( M )|y_ oo . vr"X]-E[4(M)|nr=i(y-oo, vr A: X)]| inL 1 , 

uGUo 

esssup|E[^o(M)|y_„, ] -E[4)(")|y— ,o]| a.s. 
«eU_„ j0 

3. The remote past does not affect the present: 

E[4)(u)|y_» i0 ] = E[4)(«)|lT=i (y— ,o V T k X)} for all u e U . 

Then the mean-optimal strategy u (Lemma \2.5\) satisfies Lj(u) — > L* a.s. as T — > °o. 
In particular, it follows from Theorem \2.6\ that u is pathwise optimal. 

4 In [TJ Appendix II.B] it is shown that under a continuity assumption on the loss function /, the 
optimal asymptotic loss in the full information setting is given by E[inf„E[/(i/,Xi)|Xo,X_i, . . .]]. 
However, a counterexample is given of a discontinuous loss function for which this expression 
does not yield the optimal asymptotic loss. The key difference with the expression for L* given in 
Theorem l2.6l is that in the latter the essential infimum runs over u 6 Uo, while it is implicit in (T) that 
the infimum in the above expression is an essential infimum over u 6 U_„,o- As the counterexample 
in (T] shows, these quantities need not coincide in the absence of continuity assumptions. 
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The assumptions of Corollary |2.11| are stronger than those of Theorem |2.6| in two 
respects. First, Assumption 3 is slightly strengthened; however, this is a very mild 
requirement. More importantly, a third martingale is assumed to converge uniformly 
(pathwise!) in Assumption 2. The latter is not an innocuous requirement: while the 
assumption holds in many cases of interest, substantial regularity of the loss function 
is needed (see section [37X1 for further discussion). In particular, this requirement is 
not automatically satisfied in the case of full information, and Theorem l 1 . 1 I therefore 
does not follow in its entirety from our results. It remains an open question whether 
it is possible to establish pathwise optimality of the mean-optimal strategy u under 
a substantial weakening of the assumptions of Corollary 12.1 II 

A particularly simple regularity assumption on the loss is that the decision space 
U is finite. In this case uniform convergence is immediate, so that the assumptions 
of Corollary |2.1 ll reduce essentially to the y o-conditional ^-property. Therefore, 
evidently Corollary 12.111 implies Theorem 1 1.7 1 More general conditions that ensure 
the validity of the requisite assumptions will be discussed in section[3] 



2.3 Weak pathwise optimality 

In the previous section, we have seen that a pathwise optimal strategy u* exists 
under general assumptions. However, unlike in the full information case, it is not 
clear whether in general (without a nontrivial complexity assumption) the mean- 
optimal strategy u is pathwise optimal. In the present section, we will aim to obtain 
some additional insight into this issue by considering the notion of weak pathwise 
optimality (Definition 12.3b that is intermediate between pathwise optimality and 
mean optimality. This notion is more regularly behaved than pathwise optimality; 
in particular, it is straightforward to prove the following simple result. 

Lemma 2.12. Suppose that a weakly pathwise optimal strategy u* exists. Then the 
mean-optimal strategy u is also weakly pathwise optimal. 

Proof. Let At = j Yj= \ A o T k . As \Lt (u) | < At for any strategy u, we have 

E[(L T (u) - L r (u*))_] < e P[L T (u) - L T (u*) > -e] + E[2A r l Lr(fl) _ Mu *)<_ e ] 

for any e > 0. Note that the sequence (Ar)7->i is uniformly integrable as At —>E[A] 
in L 1 by the ergodic theorem. Therefore, using weak pathwise optimality of u*, it 
follows that E[(L T (u) -Lr(u*))_] -t as T -> °°. We therefore have 

lim sup E [ | L T (u) - L T (u* ) | ] = - lim inf {E [L T (u* ) ] - E [L T (u) ] } < 

T— >oo r^oo 

by mean-optimality of u. It follows easily that u is also pathwise optimal. □ 

While Theorem l2.6l does not ensure that the mean-optimal strategy u is pathwise 
optimal, the previous lemma guarantees that u is at least weakly pathwise optimal. 
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However, we will presently show that the latter conclusion may follow under con- 
siderably weaker assumptions than those of of Theorem 12.61 Indeed, just as path- 
wise optimality was established for conditional ^-automorphisms, we will establish 
weak optimality for conditionally weakly mixing automorphisms. 

Let us begin by developing a general result on weak pathwise optimality, Theo- 
rem l2.13l below. that plays the role of Theorem l2.6l in the present setting. The essen- 
tial assumption of this general result is that the conditional weak mixing property 
(Definition 1 1.5b holds uniformly with respect to the loss function I, For simplicity 
of notation, let us define as in Theorem l2.6l the optimal asymptotic loss 

L* :=E|"essinfE[£ («)|y-oco] 

(let us emphasize, however, the Assumption 3 of Theorem l2.6l need not hold in the 
present setting!) In addition, let us define the modified loss functions 

4(w):=4(")-E[4)(")|¥-~,o], i%(u):=e (u)U<M-E[Hu)lA<M\y-~,o\- 
The proof of the following theorem will be given in section POl 
Theorem 2.13. Suppose that the uniform conditional mixing assumption 
1 T 



lim limsup 



-£esssup|E[{i?f(«) or-*} i*y) |y__ i0 ] I 
k=\ uXeUo 



= 
1 



holds. Then the mean-optimal strategy u is weakly pathwise optimal, and the optimal 
long time-average loss satisfies the ergodic theorem Lj(u) —> L* in L 1 . 

Remark 2.14. We have assumed throughout that the loss function £ is dominated in 
L . If the loss is in fact dominated in L~, that is, \£(u, C0)\ <A(co) with A 6L , then 
the assumption of Theorem l2 . 1 3 1 is evidently implied by the natural assumption 

i £ ess sup |E[{4(m) o T- k ] ? (i0|¥— ,o] I ^ in L ' . 

1 k=i u,u'eV 

and in this case Lj (u) —> L* in L? (by dominated convergence). The additional trun- 
cation in Theorem l2.13l is included only to obtain a result that holds in L l . 

Conceptually, as in Theorem 12.61 the assumption of Theorem 12. 131 combines a 
conditional mixing assumption and a complexity assumption. Indeed, the condi- 
tional weak mixing property relative to y_oo.o (Definition ! 1.5b implies that 

I £ |E[{/o T- k } g\y^ fi ] -E\foT- k \^ }E[g\^ ]\ ^ in L 1 
1 k=\ 



for every f,g £ L 2 (indeed, for simple functions f,g this follows directly from the 
definition, and the claim for general /, g follows by approximation in L 2 ). Therefore, 
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in the absence of the essential supremum, the assumption of Theorem |2. 13| reduces 
essentially to the assumption that the dynamical system (Q , 23 , P, T) is conditionally 
weak mixing relative to ^-00,0- However, Theorem 12. 1 31 requires in addition that 
the convergence in the definition of the conditional weak mixing property holds 
uniformly with respect to the possible decisions u eVq. This will be the case when 
the loss function £ is not too complex (cf. section |3). For example, in the extreme 
case where the decision space U is finite, uniformity is automatic, and thus Theorem 
ll.8l in the introduction follows immediately from Theorem l2.13l 

Recall that a pathwise optimal strategy is necessarily weakly pathwise optimal. 
This is reflected, for example, in Theorems 1 1 . 71 and 1 1 . 8 1 indeed, note that 

\\Y[Ar\T k B\Z]-Y[A\Z]Y[T k B\Z}\\i 
= ||E[{l A -P[A|Z]}l 7 * B |Z]||i 

< ||E[{i A -p[A|z]}p[r A: z?|r ,: -' 1 x]|z]||i + ||i rtB -p[r*B|r* : - n x]||i 

< ||P[A|zvr ,: -"X]-p[A|2;]|| 1 + ||i B -p[B|r-"X]|| 1 

for any n,k, so that the conditional /^-property implies the conditional weak mix- 
ing property (relative to any a-field Z) by letting k — > «, then n — > °°. Along the 
same lines, one can show that a slight variation of the assumptions of Theorem l2.6l 
imply the assumption of Theorem 12.131 (modulo minor issues of truncation, which 
could have been absorbed in Theorem [276] also at the expense of heavier notation). 
It is not entirely obvious, at first sight, how far apart the conclusions of our main 
results really are. For example, in the setting of full information, cf. Example 12. 101 
the assumption of Theorem l2.13l holds automatically (as then £^{u) o T~ k is y_oo.o- 
measurable for every u € Uq and k > 1). Moreover, the reader can easily verify that 
in all the examples we have given where no pathwise optimal strategy exists (Ex- 
amples [T2[T3[2jOE9]i, even the existence of a weakly pathwise optimal strategy 
fails. It is therefore tempting to assume that in a typical situation where a weakly 
pathwise optimal strategy exists, there will likely also be a pathwise optimal strat- 
egy. The following example, which is a manifestation of a rather surprising result in 
ergodic theory due to Conze [6], provides some evidence to the contrary. 

Example 2.15 (Generic transformations). In this example, we fix the probability 
space (X2,23,P), where Q = [0, 1] with its Borel cr-field 23 and the Lebesgue mea- 
sure P. We will consider the decision space U = {0, 1} and loss function £ defined 

as 

£(u,(o) = — n(l[ 0j i/2](fi>) - 1/2) for (u,a>) eU xQ. 

Moreover, we will consider the setting of blind decisions, that is, V is trivial. 

We have not yet defined a transformation T. Our aim is to prove the following:/or 
a generic invertible measure-preserving transformation T, there is a mean-optimal 
strategy u that is weakly pathwise optimal but not pathwise optimal. This shows not 
only that there can be a substantial gap between Theorems 1 1 . 7 1 and 1 1 . 8 1 but that this 
is in fact the typical situation (at least in the sense of weak topology). 
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Let us recall some basic notions. Denote by 2? the set of all invertible measure- 
preserving transformations of (Q,T>,P). The weak topology on & is the topology 
generated by the basic neighborhoods B(T ,B,e) ={16 ST :P[TB AT B] < e} for 
all Tq G 3F , B G 23, e > 0. A property is said to hold for a generic transformation if 
it holds for every transformation T in a dense G$ subset of 3F . A well-known result 
of Halmos liTJl states that a generic transformation is weak mixing. Therefore, for 
a generic transformation, any mean-optimal strategy u is weakly pathwise optimal 
by Theorem ll.8l This proves the first part of our statement. 

Of course, in the present setting, E[^(«)|yo,*] = E[^(m)] = for every decision 
u G U . Therefore, every admissible strategy u is mean-optimal, and the optimal 
mean loss is given by L* = 0, regardless of the choice of transformation T G & . It 
is natural to choose a stationary strategy u (for example, it* = 1 for all k) so that 
lim7'^ 00 L7-(u) = L* a.s. We will show that for a generic transformation, the strategy 
u is not pathwise optimal. To this end, it evidently suffices to find another strategy 
u such that \m\mfr ^«,Lj(yi) < L* with positive probability. 

To this end, we use the following result of Conze that can be read off from the 
proof of J6] Theorem 5]: there exists a sequence \ °° with kjn^ — > 1/2 such that 
for every < a < 1 and 1/2 < A < 1, a generic transformation T satisfies 



Define the strategy u such that u„ = 1 if « = for some k, and u n = otherwise. 
Then, for a generic transformation T, we have with probability at least 1 — a 



In words, we have shown that for a generic transformation T, the time-average loss 
of the mean-optimal strategy u exceeds that of the strategy u infinitely often by 
almost 1/4 with almost unit probability. Thus the mean-optimal strategy u fails to 
be pathwise optimal in a very strong sense, and our claim is established. 

Example l2. 15l onlv shows that there is a mean-optimal strategy u that is weakly 
pathwise optimal but not pathwise optimal. It does not make any statement about 
whether or not a pathwise optimal strategy u* actually exists. However, we do not 
know of any mechanism that might lead to pathwise optimality in such a setting. We 
therefore conjecture that for a generic transformation a pathwise optimal strategy 
in fact fails to exist at all, so that (unlike in the full information setting) pathwise 
optimality and weak pathwise optimality are really distinct notions. 

The result of Conze used in Example 12.151 originates from a deep problem in 
ergodic theory that aims to understand the validity of individual ergodic theorems 
for subsequences, cf. l6][2] and the references therein. A general characterization of 
such ergodic properties does not appear to exist, which suggests that the pathwise 
optimality property may be difficult to characterize beyond general sufficient con- 
ditions such as Theorem 12. 61 In contrast, the weak pathwise optimality property is 




liminfL nr (u) = -limsup — £ (l [(I1/2] o T" k - 1/2) < 



2A-1 
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much more regularly behaved. The following theorem, which will be proved in sec- 
tion |44]below, provides a complete characterization of weak pathwise optimality in 
the special case that the observation field y is invariant. 

Theorem 2.16. Let (X2,23,P,r) be an ergodic dynamical system, and suppose that 
(i2,23,P) is a standard probability space and that y C 23 is an invariant <J -field 
(that is, y = T~ yj. Then the following are equivalent: 

1. (i2,23,P,r) conditionally weak mixing relative to y. 

2. For every bounded loss function I : U x Q — > R with finite decision space 
cardt/ < °° there exists a weakly pathwise optimal strategy. 

The invariance of y is automatic in the setting of blind decisions (as y is trivial), 
in which case Theorem |2.16| yields a decision-theoretic characterization of the weak 
mixing property. In more general observation models, invariance of y may be an un- 
natural requirement from the point of view of decisions under partial information, 
as it implies that there is no information gain over time. On the other hand, appli- 
cations of the notion of conditional weak mixing relative to a c-field Z in ergodic 
theory almost always assume that Z is invariant (e.g., 11261 ). Theorem 12 . 1 61 yields a 
decision-theoretic interpretation of this property by choosing y = Z. 



3 Complexity and conditional ergodicity 

3.1 Universal complexity assumptions 

The goal of this section is to develop complexity assumptions on the loss function 
I that ensure that the uniform convergence assumptions in our main results hold 
regardless of any properties of the transformation T or observations y. While such 
universal complexity assumptions are not always necessary (for example, in the full 
information setting uniform convergence holds regardless of the loss function), they 
frequently hold in practice and provide easily verifiable conditions that ensure that 
our results hold in a broad class of decision problems with partial information. 
The simplest assumption is Grothendieck's notion of equimeasurability lfl2l . 

Definition 3.1. The loss function I : U x Q — > K on the probability space (i2,23,P) 
is said to be equimeasurable if for every £ > 0, there exists Q £ E 23 with P[£2 £ ] > 

1 — e such that the class of functions {£o(u)^n £ '■ u € U} is totally bounded in L°°(P). 

The beauty of this simple notion is that it ensures uniform convergence of almost 
anything. In particular, we obtain the following results. 

Lemma 3.2. Suppose that the loss function £ is equimeasurable. Then Assumption 

2 of Corollary \2.11\ holds, and thus Assumption 2 of Theorem \2.6\ holds as well, 
provided that X is a generating G-field ( that is, V„ T~ n "X = 25). 
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Proof. Let us establish the first line of Assumption 2. Fix e > and £2 £ as in Def- 
inition 13.11 Then there exist N < °° measurable functions l\ , . . . , : Q — > R such 
that for every u EU, there exists k(u) 6 { 1 , . . . ,N} such that 



{u)1q £ ~ l k{u) ln e \\oo < e 
(and u i-> can clearly be chosen to be measurable). It follows that 

esssup|E[£o(M)|y-ocoVr-"X]-4(«)| < max \E[l k la e \^-^o VT~ n X\ -felflj 

hGUo l<k<N> 

+ 2e+E[Al fl j|y_ i0 vr-"X]+Al fl | 
As X is generating, the martingale convergence theorem gives 



lim sup 



ess sup |E[4(w)|y_oo,o V T-"X] - t (u) 
«eUo 



<2e+E[2AW]. 



Letting e J, yields the first line of Assumption 2. The remaining statements of 
Assumption 2 follow by an essentially identical argument. □ 

Lemma 3.3. Suppose that the following conditional mixing assumption holds: 

T 



lim lim sup 



^I|E[{ig r (u)or-*}^(u')|V- > o] 



/or every u,u' E U. 



If the loss function I is equimeasurable, then the assumption ofTheorem \2.13\ holds. 

Proof. The proof is very similar to that of Lemma [3~2l and is therefore omitted. □ 

As an immediate consequence of these lemmas, we have: 

Corollary 3.4. The conclusions of Theorems 17. 71 and \1.8\ remain in force if the as- 
sumption that U is finite is replaced by the assumption that t is equimeasurable. 

We now give a simple condition for equimeasurability that suffices in many cases. 
It is closely related to a result of Mokobodzki (cf. J9j Theorem IX. 19]). 

Lemma 3.5. Suppose that U is a compact metric space and that u i-> £(u,co) is 
continuous for a.e. CO £ Q. Then i is equimeasurable. 

Proof. As U is a compact metric space (with metric d), it is certainly separable. Let 
Uo C U be a countable dense set, and define the functions 

b„= sup \£ (u)-£ (u')\. 

b n is measurable, as it is the supremum of countably many random variables. More- 
over, for almost every CO, the function u M> £(u,co) is uniformly continuous (being 
continuous on a compact metric space). Therefore, b„ \, a.s. as n — > °°. 
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By Egorov's theorem, there exists for every e > a set £2 £ with P[£2 £ ] > 1 — e 
such that ||fe B lfl e ||oo 4- 0. We claim that {£o{u)\q e : u G U} is compact in L°°. Indeed, 
for any sequence (u n ) n >\ C U we may choose a subsequence (w,,^ )^-> i that converges 
to Moo £ U. Then for every r, we have \£o(u n .) — ^o( M °°)l < b r for all k sufficiently 
large, and therefore \\£o(u nk )ln c - £o(u^)1q £ ||°° 0. □ 

Let us give two standard examples of decision problems (cf. ITT1 I241 ). 

Example 3.6 (£ p -prediction). Consider the stochastic process setting (X,Y), and let 
/be a bounded function. The aim is, at each time k, to choose a predictor of 
f(Xk+i) on the basis of the observation history Yq, . . . ,1^. We aim to minimize the 
pathwise time-average £ ; ,-prediction loss jYj=i \ u k ~ f(Xk+i)\ p (p > 1)- This is 
a particular decision problem with partial information, where the loss function is 
given by £q(u) = \u — f(X\)\ p and the decision space is U = [inf c /(x),sup_ c /(x)]. It 
is immediate that I is equimeasurable by Lemma [331 

Example 3. 7 (Log-optimal portfolios). Consider a market with d securities (e.g., d — 
1 stocks and one bond) whose returns in day k are given by the random variable Xk 
with values in M£_. The decision space U = {p e R c l : £f =1 pf = 1} is the simplex: 
u{ represents the fraction of wealth invested in the ith security in day k. The total 
wealth at time T is therefore given by n|=i { u k,Xk). We only have access to partial 
information in day k, e.g., from news reports. We aim to choose an investment 
strategy on the basis of the available information that maximizes the wealth, or, 
equivalently, its growth j Yj=i ^°s{ u k,Xk). This corresponds to a decision problem 
with partial information for the loss function £q(u) = — log(t/,Xo). 

In order for the loss to be dominated in L , we impose the mild assumption 
E[A] < °° with A = Y?i=\ I l°g^ol- We claim that the loss I is then also equimeasur- 
able. Indeed, as E[A] < °°, the returns must satisfy Xq > a.s. for every i. Therefore, 
equimeasurability follows directly from Lemma [3751 

As we have seen above, equimeasurability follows easily when the loss function 
possesses some mild pointwise continuity properties. However, there are situations 
when this may not be the case. In particular, suppose that £(u,(o) only takes the 
values and 1, that is, our decisions are sets (as may be the case, for example, in 
predicting the shape of an oil spill or in sequential classification problems). In such a 
case, equimeasurability will rarely hold, and it is of interest to investigate alternative 
complexity assumptions. As we will presently explain, equimeasurability is almost 
necessary to obtain a universal complexity assumption for Corollary [ZjTJ however, 
in the setting of Theorem l2.6l the assumption can be weakened considerably. 

The simplicity of the equimeasurability assumption hides the fact that there are 
two distinct uniformity assumptions in Corollary 12.1 II we require uniform conver- 
gence of both martingales and reverse martingales, which are quite distinct phenom- 
ena (cf. |[T8l[T7l ). The uniform convergence of martingales can be restrictive. 

Example 3.8 (Uniform martingale convergence). Let (S„)„>i be a filtration such 
that each S„ = cr{7r„} is generated by a finite measurable partition %„ of the prob- 
ability space (i2,23,P). Let L : N x Q — > R a bounded function such that L(u, ■ ) is 
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Soo-measurable for every i/£N. Then E[L(u, ■ )\9„] — ► L(u, ■ ) a.s. for every u. We 
claim that if this martingale convergence is in fact uniform, that is, 

sup|E[L(zv)|S„]-L(w,-)| -^-0 inL 1 , 

hGN 

then L must necessarily be equimeasurable. To see this, let us first extract a subse- 
quence n\ t °° along which the uniform martingale convergence holds a.s. Fix e > 0. 
By Egorov's theorem, there exists a set £2 £ with P[£2 £ ] > 1 — £ such that 

sup ||E[L(w, ■ )\9„ k }ln e ~L(u, • )l^ e |U ^> 0. 

«gn 

Therefore, for every a > 0, there exists k such that 

sup||aL«" 1 E[L( M ,-)|S, I Jl^J-L(M,-)li2 E l|oo<2a. 

uen 

But as 9« is finitely generated, we can write 

E[L(u,-)\S„]1q e = L n ^.pl Pnn£ , 

Pen,, 

with \L, hU . P \ < ||L||ooforall«,M,P. In particular, {aLa _1 E[L(M,-)|g„]l r2 J : u e N} 
is a finite family of random variables for every «. We have therefore established that 
the family {L(u, -)ln e : u £ N} is totally bounded in L°°. 

In the context of Corollary 12.1 II the previous example can be interpreted as fol- 
lows. Suppose that the observations are finite-valued, that is, ^ is a finitely generated 
(7-field. Let us suppose, for simplicity, that the decision space U is countable (the 
same conclusion holds for general U modulo some measurability issues). Then, if 
the third line of Assumption 2 in Corollary 12.111 holds, then the conditioned loss 
E[A)(«) |y-°o,o] is necessarily equimeasurable. While it is possible that the condi- 
tioned loss is equimeasurable even when the loss I is not (e.g., in the case of blind 
decisions), this is somewhat unlikely to be the case given a nontrivial observation 
structure. Therefore, it appears that equimeasurability is almost necessary to obtain 
universal complexity assumptions in the setting of Corollary |2.1 11 

The situation is much better in the setting of Theorem 12. 61 however. While the 
first line of Assumption 2 in Theorem l2.6l is still a uniform martingale convergence 
property, the c-field X cannot be finitely generated except in trivial cases. In fact, in 
many cases the loss £ will be r~"X-measurable for some n < °o, in which case the 
first line of Assumption 2 is automatically satisfied (in particular, in the stochastic 
process setting, this will be the case for finitary loss Iq(u) = l(u,X H[ , . . . ,X„ k ) if we 
choose X = oiXk, : k < 0}). The remainder of Assumption 2 is a uniform reverse 
martingale convergence property, which holds under much weaker assumptions. 

Definition 3.9. The loss £ : U x Q — s- M on (Q , 25) is said to be universally bracket- 
ing if for every probability measure P and e,M > 0, the family {£q(u)1a<m -UGU} 
can be covered by finitely many brackets {/ : g < f < h} with \\g — h\\ L i(j>) < £■ 
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Lemma 3.10. Let (Q , 23) be a standard space, and let X, y be countably generated. 
Suppose the loss £ is universally bracketing and finitary (that is, for some n G Z, 
£q(u) is T~""X-measurablefor all u G U). Then Assumption 2 ofTheorem \2.6\ holds. 

Proof. The finitary assumption trivially implies the first line of Assumption 2. The 
second line follows along the lines of the proof of ifTTl Corollary 1. 4(2=^7)] □ 

To show that universal bracketing can be much weaker than equimeasurability, 
we give a simple example in the context of set estimation. 

Example 3.1 1 (Confidence intervals). Consider the stochastic process setting (X,Y) 
where X takes values in the set [—1,1], and fix £ > 0. We would like to pin down the 
value of Xk up to precision e; that is, we want to choose G [—1,1] as a function 
of the observations Yq, . . . ,Fj such that < < u^ + e as often as possible. This is 
a partial information decision problem with loss function £q(u) = l-g\[ u ,u+e[(Xo) • 

The proof of the universal bracketing property of £ is standard. Given P and 
£ > 0, we choose — 1 = oq < a\ < ■■■ <a n = 1 (for some finite n) in such a way 
that P[a,- < Xq < a,+i] < £ for all i (note that every atom of Xq with probability 
greater than £ is one of the values a,-). Put each function £q(u) such that u = a, 
or u + £ = a, for some i in its own bracket, and consider the additional brackets 

{/ : lR\]«,_ l , ai+1 [ < / < lR\[«i>«/]} for a11 1 - ' - J < Then evidentl y each of the 
brackets has diameter not exceeding 2e, and for every u£(/ the function £q(u) is 
included in one of the brackets thus constructed. 

On the other hand, whenever the law of Xq is not purely atomic, the loss £ cannot 
be equimeasurable. Indeed, as po(w)lf2 £ — ^o{u'yia e ||°o = 1 whenever £<.)(u)\Q e ^ 
£o(u')l-£i e , it is impossible for {£o(«)lfl e : u G U} to be totally bounded in L°° for 
any infinite set Q £ (and therefore for any set of sufficiently large measure). 

In IfTTl a detailed characterization is given of the universal bracketing property. 
In particular, it is shown that a uniformly bounded, separable loss £ on a standard 
measurable space is universally bracketing if and only if {£o(u) : u G U} is a uni- 
versal Glivenko-Cantelli class, that is, a class of functions for which the law of 
large numbers always holds uniformly. Many useful methods have been developed 
in empirical process theory to verify this property, cf. |[T0l|29l . In particular, for a 
separable {0, l}-valuedloss, a very useful sufficient condition is that {£q(u) :u G {/} 
is a Vapnik-Chervonenkis class. We refer to IfTTl [l0l|29l for further details. 



3.2 Conditional absolute regularity 

In the previous section, we have developed universal complexity assumptions that 
are applicable regardless of other details of the model. In the present section, we 



5 The pointwise separability assumption in |17| Corollary 1.4(2=>7)] is not needed here, as the 
essential supremum can be reduced to a countable supremum as in the proof of Lemma l2~31 
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will in some sense take the opposite approach: we will develop a sufficient condi- 
tion for a stronger version of the conditional A"-property (in the stochastic process 
setting) under which no complexity assumptions are needed. This shows that there 
is a tradeoff between mixing and complexity; if the mixing assumption is strength- 
ened, then the complexity assumption can be weakened. An additional advantage of 
the sufficient condition to be presented is that it is in practice one of the most easily 
verifiable conditions that ensures the conditional A"-property. 

In the remainder of this section, we will work in the stochastic process setting. 
Let {X,Y) be a stationary ergodic process taking values in the Polish space E x F. 
We define y, hm = <7{ii. : n < k < m} and X„.„, = <j{X k : n < k < m} for n < m, and 
we consider the observation and generating fields y = <7{io}, X = X_oo,o V ^-00,0- 
In this setting, the conditional A"-property relative to y_oo,o reduces to 

f)(y—fiVX—-k)=y—,0 modP. 

k=l 

If y is trivial (that is, the observations Y are noninformative), this reduces to the 
statement that X has a trivial past tail c-field, that is, X is regular (or purely nonde- 
terministic) in the sense of Kolmogorov. This property is often fairly easy to check: 
for example, any Markov chain whose law converges weakly to a unique invari- 
ant measure is regular (cf. IT281 Prop. 3]). When ^ is nontrivial, the conditional 
A"-property is generally not so easy to check, however. We therefore give a con- 
dition, arising from filtering theory (27), that allows to deduce conditional mixing 
properties from their more easily verifiable unconditional counterparts. 

We will require two assumptions. The first assumption states that the pair (X,Y) 
is absolutely regular in the sense of Volkonskii and Rozanov l30l (this property is 
also known as j3 -mixing). Absolute regularity is a strengthening of the regularity 
property; assuming regularity of (X, Y) is not sufficient for what follows lfl6l . Many 
techniques have been developed to verify the absolute regularity property; for ex- 
ample, any Harris recurrent and aperiodic Markov chain is absolutely regular 1221 . 

Definition 3.12. The process (X,Y) is said to be absolutely regular if 

\\P[(X k ,Y k ) k >„ G-|X_»,oVy_^]-P[(Xii,n)t>» G -lll-Tv inL '' 

By itself, however, absolute regularity of (X,Y) is not sufficient for the con- 
ditional A"-property, as can be seen in Example 11.31 In this example, the relation 
between the processes X and Y is very singular, so that things go wrong when we 
condition. The following nondegeneracy assumption rules out this possibility. 

Definition 3.13. The process (X,Y) is said to be nondegenerate if 

P[Fi,... ,Y m e • |Z_, V Z m+l .J\ ~ P[Fi,... ,Y m e • I y -00,0 V y m+ i.oe] a.s. 

for every 1 < m < °°, where Z,„„, := X„.,„ V y«, m . 



24 



Ramon van Handel 



The nondegeneracy assumption ensures that the null sets of the law of the obser- 
vations Y do not depend too much on the unobserved process X. The assumption is 
often easily verified. For example, if Y^ = + r]t where r\k is an i.i.d. sequence 
of random variables with strictly positive density, then the conditional distributions 
in Definition l3.13l have strictly positive densities and are therefore equivalent a.s. 

Theorem 3.14 ( B27I1 ). If (X,Y) is absolutely regular and nondegenerate, then 



Theorem 13 . 1 41 provides a practical method to check the conditional /^-property. 
However, the proof of Theorem l3. 14l actuallv yields a much stronger statement. It is 
shown in [27, Theorem 3.5] that if (X,Y) is absolutely regular and nondegenerate, 
then X is conditionally absolutely regular relative to y _oo.oo in the sense that 



P[(A*)*>» G -IX-^oVy— ..]-P[(X*) t > n e -|y — ,-]||tv inL 1 . 



From these properties, we can deduce the following result. 

Theorem 3.15. In the setting of the present section, suppose that (X,Y) is absolutely 
regular and nondegenerate, and consider a loss function of the form £q(u) = 1(u,Xq). 
Then the conclusions ofTheorem \2.6\ hold. 

The key point about Theorem l3.15l is that no complexity assumption is imposed: 
the loss function l(u,x) may be an arbitrary measurable function (as long as it is 
dominated in L 1 in accordance with our standing assumption). The explanation for 
this is that the conditional absolute regularity property is so strong that the reg- 
ular conditional probabilities P[Xo G • l^-oo.oo V X-„-n] converge in total variation. 
Therefore, the corresponding reverse martingales converge uniformly over any dom- 
inated family of measurable functions. The strength of the conditional mixing prop- 
erty therefore eliminates the need for any additional complexity assumptions. In 
contrast, we may certainly have pathwise optimal strategies when absolute regular- 
ity fails, but then a complexity assumption is essential (cf. Example l2.8b . 

The proof of Theorem l3 . 1 5 1 will be given in section POl The proof is a straightfor- 
ward adaptation of Theorem l2.6l unfortunately, the fact that the conditional absolute 
regularity property is relative to y_oo j0 o rather than y_oo.o complicates a direct veri- 
fication of the assumptions of Theorem l2.6l (while this should be possible along the 
lines of 11271 . we will follow the simpler route here). The results of 11271 could also 



f]{^ \/X^ k )=V-^o modP. 



fc=i 




P[(X*)t<0 G • |y— ,0] ~ P[(Xk)k<0 G • a.s. 



6 Some of the statements in 1271 are time-reversed as compared to their counterparts stated here. 
However, as both the absolute regularity and the nondegeneracy assumptions are invariant under 
time reversal (cf. [30] for the former; the latter is trivial), the present statements follow immediately. 
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be used to obtain the conclusion of Corollary 12. 1 11 in the setting of Theorem 13. 151 
under somewhat stronger nondegeneracy assumptions. 



3.3 Hidden Markov models and nonlinear filters 

The goal of the present section is to explore some implications of our results to 
filtering theory. For simplicity of exposition, we will restrict attention to the classical 
setting of (general state space) hidden Markov models (see, e.g., H). 

We adopt the stochastic process setting and notations of the previous section. In 
addition, we assume that (X,Y) is a hidden Markov model, that is, a Markov chain 
whose transition kernel can be factored as P(x,y,dx? ,dy') = P(x,dx') &(x',dy'). 
This implies that the process X is a Markov chain in its own right, and that the 
observations Y are conditionally independent given X. In the following, we will as- 
sume that the observation kernel <J> has a density, that is, <P(x,dy) — g(x,y) (p(dy) 
for some measurable function g and reference measure (p. 

A fundamental object in this theory is the nonlinear filter TI k , defined as 

n k :=P[X k e-\Y ,...,Y k }. 

The measure-valued process TI = (n k ) k >o is itself a (nonstationary) Markov chain 
lfl6ll with transition kernel To study the stationary behavior of the filter, which 
is of substantial interest in applications (see, for example, 031 and the references 
therein), one must understand the relationship between the ergodic properties of X 
and n. The following result, proved in [161, is essentially due to Kunita 1201 . 

Theorem 3.16. Suppose that the transition kernel P possesses a unique invariant 
measure (that is, X is uniquely ergodic). Then the filter transition kernel & pos- 
sesses a unique invariant measure ( that is, TI is uniquely ergodic) if and only if 

f](y_ 00i oVX_ 00) _ fc )=y_« Ji o modP. 

k=l 

Evidently, ergodicity of the filter is closely related to the conditional /^-property. 
We will exploit this fact to prove a new optimality property of nonlinear filters. 

The usual interpretation of the filter TI k is that one aims to track to current loca- 
tion X k of the unobserved process on the basis of the observation history Yq, . . . ,Y k . 
By the elementary property of conditional expectations, TI k (f) provides, for any 
bounded test function /, an optimal mean-square error estimate of f(X k ): 

E [{f(X k ) - n k (f)} 2 ] < E [{f(X k ) - f k (Y , . . .,Y k )} 2 ] for any measurable f k . 

This interpretation may not be satisfying, however, if only one sample path of the 
observations is available (recall Examples ll.2l and ll.31 i: one would rather show that 
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liminf 



~ I {/(**) - fk(Yo, ■ ■ -Jk)} 2 - 7 £ {/(**) - W)} 2 



> a.s. 



for any alternative sequence of estimators (fk)k>o- If this property holds for any 
bounded test function /, the filter will be said to be pathwise optimal. 

Corollary 3.17. Suppose that the filtering process FI is uniquely ergodic. Then the 
filter is both mean-square optimal and pathwise optimal. 

Proof. Note that the filter Tl^f) is the mean-optimal policy for the partial informa- 
tion decision problem with loss £o(u) = {/(Xq) — u} 2 . As the latter is equimeasur- 
able, the result follows directly from Theorem l3. 161 and Corollarv l2.11l □ 

The interaction between our main results and the ergodic theory of nonlinear 
filters is therefore twofold. On the one hand, our main results imply that ergodic 
nonlinear filters are always pathwise optimal. Conversely, Theorem l3. 16l shows that 
ergodicity of the filter is a sufficient condition for our main results to hold in the 
context of hidden Markov models with equimeasurable loss. This provides another 
route to establishing the conditional A"-property: the filtering literature provides a 
variety of methods to verify ergodicity of the filter lTl4l l5ll7l l27l . It should be noted, 
however, that ergodicity of the filter is not necessary for the conditional /if -property 
to hold, even in the setting of hidden Markov models. 

Example 3.18. Consider the hidden Markov model (X,Y) where X is the station- 
ary Markov chain such that Xq ~ Uniform([0, 1]) and X k+ i = 2X k modi, F# = 
for all k G Z (that is, we have noninformative observations). Clearly the tail a- 
field p|„ X-oo,,, is nontrivial, and thus the filter fails to be ergodic by Theorem l3. 161 
Nonetheless, we claim that the conditional A"-property holds, so that our main results 
apply for any equimeasurable loss; in particular, the filter is pathwise optimal. 

The key point is that, even in the hidden Markov model setting, one need not 
choose the "canonical" generating ff-field X = X_oo.o in Definition 11.61 In the 
present example, we choose instead X = o{lx k >i/2 '■ k < 0}. To verify the condi- 
tional A"-property, note that (lx t >i/2)/tGZ are i.i.d. Bernoulli(l /2) random variables 
and 

X k =j^2- l - l \ XM>l i2 a.s. for all k e Z. 

Thus X C r~'X by construction, \/ k T~ k X = a{X„ : n G Z} is a generating c-field, 
and f] k T k jC is trivial by the Kolmogorov zero-one law. 

Let us now consider the decision problem in the setting of a hidden Markov 
model with equimeasurable loss function £q(u) = 1(u,Xq). If the filter is ergodic, 
then Corollary 13 .41 ensures that the mean-optimal strategy u is pathwise optimal. In 
this setting, the mean-optimal strategy can be expressed in terms of the filter: 

% = argminE[/(M,Zjt)|Fo, . . . = argmin / l(u,x) Tl^dx). 
ueu ueu J 
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When X k takes values in a finite set E = {1,... ,d}, the filter can be recursively 
computed in a straightforward manner |4|. In this case, the mean-optimal strategy 
u can be implemented directly. On the other hand, when £ is a continuous space, 
the conditional measure n k is an infinite-dimensional object which cannot be com- 
puted exactly except in special cases. However, TI^ can often be approximated very 
efficiently by recursive Monte Carlo approximations TI^ = j^Tli=\ ^z N (i)' known 



as particle filters B, that converge to the true filter IZj as the number of particles 
increases N — > °°. This suggests to approximate the mean-optimal strategy u by 

[ 1 N 

u k « u N k := argmin j l(u,x)n^{dx) = argmin - £ l(u,Z% (/)). 



ueu J ueu N : 



The strategy is a type of sequential stochastic programming algorithm to approx- 
imate the mean-optimal strategy. In this setting, it is of interest to establish whether 
the strategy u N is in fact approximately pathwise optimal, at least in the weak sense. 
To this end, we prove the following approximation lemma. 

Lemma 3.19. In the hidden Markov model setting with equimeasurable loss £q(u) = 
1(u,Xq), suppose that the filter is ergodic, and let JJ? be an approximation ofYI k . If 



lim lim sup E 



1 T 



£ ess sup |nf(/(«, -))-n k {l(u,-)) | 



:(). 



then the strategy u is approximately weakly pathwise optimal in the sense that 
lim liminfP[Lr(u) -L T (u N ) > —el = 1 for every e > 

holds for every admissible strategy u. 
Proof. We begin by noting that 

P[Lr(u) -L T (u N ) < -e] < P[L T (u)-L T (u) < -e/2] +P[L T {u N ) -L T (u) > e/2}. 



Under the present assumptions, the mean-optimal strategy u is (weakly) pathwise 
optimal. It follows^] as in the proof of Lemma|2T2]that E[(L T (u N ) - L T (u))-} ->■ 
as T — > °°, and we obtain for any admissible strategy u and e > 

2 

limsupP[L r (u) -Lr(fi ) < -e] < -limsupE[L 7 -(u Ar ) -L T (u)]. 

T-y°° e t^oo 



To proceed, we estimate 



7 As particle filters employ a random sampling mechanism, the strategy is technically speaking 
not admissible in the sense of this paper: 11? (and therefore 5?) depends also on auxiliary sam- 
pling variables £o , . . . , that are independent of Yq , . However, it is easily seen that all our 

results still hold when such randomized strategies are considered. Indeed, it suffices to condition 
on (§fc)jt>o, so that all our results apply immediately under the conditional distribution. 
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E[L r (u")-L r (u)]=E 



<E 



k=l 



-£ {l{u N kl x)-l{u k ,x)}n k {dx) 



k=\ 



-2E 



-£ {i{fi,x)-i(u k ,x)}n»{dx) 



J^esssupin^/Cu, •))-«*(/(«,•))! 



But note that by the definition of u 

J{l(u N kl x)-l{u kl x)}n»(dx) 



inf 

«££/ 



l{u,x)n^{dx) - / l{u k ,x)n^(dx) < 0. 



The proof is therefore easily completed by applying the assumption. 



□ 



Evidently, the key difficulty in this problem is to control the time-average error 
of the filter approximation (in a norm determined by the loss function /) uniformly 
over the time horizon. This problem is intimately related with the ergodic theory 
of nonlinear filters. The requisite property follows from the results in (T31 under 
reasonable ergodicity assumptions but under very stringent complexity assumptions 
on the loss (essentially that {l(u, •) : u £ U} is uniformly Lipschitz). Alternatively, 
one can apply the results in 0, which require exceedingly strong ergodicity as- 
sumptions but weaker complexity assumptions. Let us note that one could similarly 
obtain a pathwise version of Lemma [3.191 but the requisite pathwise approximation 
property of particle filters has not been investigated in the literature. 



3.4 The conditions of Algoet, Weissman, Merhav, and Nobel 

The aim of this section is to briefly discuss the assumptions imposed in previous 
work on the pathwise optimality property due to Algoet [lj], Weissman and Mer- 
hav [32], and Nobel ll24l . Let us emphasize that, while our results cover a much 
broader range of decision problems, none of these previous results follow in their 
entirety from our general results. This highlights once more that our results are, un- 
fortunately, nowhere close to a complete characterization of the pathwise optimality 
property. 



3.4.1 Algoet 

Algoet's results (TJ, which cover the full information setting only, were already 
discussed at length in the introduction and in section l2T2l The existence of a pathwise 
optimal strategy can be obtained in this setting under no additional assumptions 
from Theorem l2.61 which even goes beyond Algoet's result in that it gives an explicit 
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expression for the optimal asymptotic loss. However, Algoet establishes that in fact 
the mean-optimal strategy u is pathwise optimal in this setting, while our general 
Corollary 12. 1 11 can only establish this under an additional complexity assumption. 
We do not know whether this complexity assumption can be weakened in general. 

3.4.2 Weissman and Merhav 

Weissman and Merhav 0321 consider the stochastic process setting (X,Y), where 
Xfr takes values in {0, 1} and takes values in R for all k 6 Z, and where the loss 
function takes the form £q(u) = l(u,X\) and is assumed to be uniformly bounded. 
As X is binary-valued, it is immediate that any loss function I is equimeasurable. 
Therefore, our results show that the mean-optimal strategy u is pathwise optimal 
whenever the model is a conditional /^-automorphism relative to V _oo.o- 
The assumption imposed by Weissman and Merhav in l32l is as follows: 

£ supE[|P[X r+ * =a\X r = a,y ,r+*-i] -P[Xr+k = a|Vo,r+*-i]|] < °° for a = 0, 1. 

Using stationarity, this condition is equivalent to 

£ supE[|P[Xi = a\X_ k = a$-r-kfl\ -P[*i = a|V- r -t,o]|] < - for a = 0, 1, 

which readily implies 

f)E[|P[Xi =a\a{X- k } vy_», ] -P[Yi = a|y— ,o]|] < «. 

k=0 

If the <7-field a{X-k} Vy_«, o could be replaced by the larger c7-field X_ 00 ._ J t Vy_«, o 
in this expression, then Assumption 3 of Corollary 12 . 1 1 1 would follow immediately. 
However, the smaller <7-field appears to yield a slightly better variant of the assump- 
tion imposed in [32|. This is possible because the result is restricted to the special 
choice of loss £q(u) = l(u,Xi) that depends on X\ only. On the other hand, it is to 
be expected that in most cases the assumption of l32l is much more stringent than 
that of Corollary 12.111 Note that Assumption 3 of Corollary 12.111 is purely qual- 
itative in nature: it states, roughly speaking, that two a-fields coincide. This is a 
structural property of the model. On the other hand, the assumption of (321 is in- 
herently quantitative in nature: it requires that a certain mixing property holds at 
a sufficiently fast rate (the mixing coefficients must be summable). A quantitative 
bound on the mixing rate is both much more restrictive and much harder to verify, 
in general, as compared to a purely structural property. 

In a sense, the approach of Weissman and Merhav is much closer in spirit to the 
weak pathwise optimality results in this paper than it is to the pathwise optimality 
results. Indeed, if we replace the weak pathwise optimality property 
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P[Lr(u) -L T {u*) < -e] for every e > 
by its quantitative counterpart 

era 

P[Lt(u) — Lr(u*) < -e] < °o for every e > 0, 

r=i 

then pathwise optimality will automatically follow from the Borel-Cantelli lemma. 
In the same spirit, if in Theorem 12. 131 we replace the uniform conditional mixing 
assumption by the corresponding quantitative counterpart 



EE 

k=l 



esssup \E[{%(u) o T- k } % («') |V_. i0 ] 



= o(r c 



for some a < 1 (that may depend on M), then we easily obtain a pathwise version of 
Lemma l4~7l below (using Etemadi's well-known device ifTTI ). and consequently the 
conclusion of Theorem l2.13l is replaced by that of Theorem l2.6l It is unclear whether 
such quantitative mixing conditions provide a distinct mechanism for pathwise op- 
timality as compared to qualitative structural conditions as in our main results. 



3.4.3 Nobel 

Nobel [24 1 considers the stochastic process setting (X,Y) with observations of the 
additive form Y k = X k + N k , where N = (Nk)kez is an L 2 -martingale difference se- 
quence independent of X. The loss function considered is the mean-square loss 
very special scenario is essential for the result given in 1241 ; 
on the other hand, it is not assumed that (X,Y) is even stationary or that the decision 
space U is a compact set (when U = R, the quadratic loss is not dominated). In 
order to compare with our general results, we will additionally assume that (X,Y) is 
stationary and ergodic and that X k are uniformly bounded random variables (so that 
we may choose U = [— \\X\ ||oo, ||oo] without loss of generality). 

While this is certainly a decision problem with partial information, the key obser- 
vation is that this special problem is in fact a decision problem with full information 
in disguise. Indeed, note that we can write for any strategy u 

Mu) = ij>*- Y k+X f + ^ £ {X 2 k+l - Y* +l } + ~ ^ 20*^+1 . 

1 k=l 1 k=\ 1 k=l 

The last term of this expression converges to zero a.s. as T — >• °° for any admis- 
sible strategy u by the martingale law of large numbers, as {iikNk+\)k£i is an L 2 - 
martingale difference sequence. On the other hand, the second to last term of this 
expression does not depend on the strategy u at all. Therefore, 
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1 1 

liminf{L r (u)-Lr(u)} = liminf<^ - -Y k+1 ) 2 - - £ {& k -Y k+1 f \ a.s., 

which corresponds to the decision problem with the full information loss £q(u) = 
(u — Y\) 2 . Thus pathwise optimality of the mean-optimal strategy u follows from 
Algoet's result. (The main difficulty in f24l is to introduce suitable truncations to 
deal with the lack of boundedness, which we avoided here.) 

Of course, we could deduce the result from our general theory in the same man- 
ner: reduce first to a full information decision problem as above, and then invoke 
Corollary |2.1 ll in the full information setting. However, a more relevant test of our 
general theory might be to ask whether one can deduce the result directly from 
Corollarv l2.11l without first reducing to the full information setting. Unfortunately, 
it is not clear whether it is possible, in general, to find a generating <7-field X such 
that Assumption 3 of Corollary 12 . 1 1 1 holds . 

One might interpret the additive noise model as a type of "informative" observa- 
tions: while X cannot be reconstructed from the observations Y, the law of X can 
certainly be reconstructed from the law of Y even if the former were not known a 
priori (this idea is exploited in 13211241 to devise universal prediction strategies that 
do not require prior knowledge of the law of X). In the hidden Markov model setting, 
there is in fact a connection between "informative" observations and the conditional 
/^-property. In particular, if (X, Y) is a hidden Markov model where X k takes a finite 
number of values, and Y k = X k + ^ where ^ are i.i.d. and independent of X, then 
the conditional -property holds, and we therefore have pathwise optimal strategies 
for any dominated loss. This follows from observability conditions in the Markov 
setting, cf. (5J section 6.2] and the references therein. However, the ideas that lead 
to this result do not appear to extend to more general situations. 



4 Proofs 

4.1 Proof of Theorem \2.6\ 

Throughout the proof, we fix a generating d-field X that satisfies the conditions of 
Theorem l2.6l In the following, we define the d-fields 

Note that % n k is decreasing in « and increasing in k. 
We begin by establishing the following lemma. 

Lemma 4.1. For any admissible strategy u and any m,n 6 Z 

i £{E[4(«*)|Sjn -E[4(««]} ^ a.s. 
1 k=\ 
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Proof. Assume m <n without loss of generality. Fix r < °°, and define 

4 = E[4K) W< r |Sl] -E[4("*)i Ao r*< r |s£ +1 ] 

for m< j < n. Then it is easily seen that we have the inequality 



k=l 



11-1 


1 T 


<E 


j=m 


1 k=\ 



£{E[Ai A>r |gs , ]+E[Ai A>r |gS]}or t . 



1 k=\ 

By the ergodic theorem, the second term on the right converges to jc(r) := E[2Alyv >r ] 
a.s. as T — > °°. It remains to consider the first term. 

To this end, note the inclusions S^ +1 Cg^C SjXv ^ f°ll° ws that 

A[ is S^+j -measurable, E[4/|S^ +1 ] = 0, and \A{\ < 2r 

for < j < n. Thus {Al)k>\ is a uniformly bounded martingale difference sequence 
with respect to the filtration (Sj(ii)*>i> and we consequently have 

1 T 



E4 7 



^0 a.s. 



k=\ 



by the simplest form of the martingale law of large numbers (indeed, it is easily seen 
that M„ = LJt=i A J k /k is an L 2 -bounded martingale, so that the result follows from 
the martingale convergence theorem and Kronecker's lemma). 
Putting together these results, we obtain 



limsup 



-I{E[4(«*)|S?]-E[4("*)|S£]} 



k=\ 



< x(r) a.s. 



for arbitrary r < °°. Letting r — > °° completes the proof. 

We can now establish a lower bound on the loss of any strategy. 
Corollary 4.2. Under the assumptions ofTheorem \2.6\ we have 

l£{4(^)-E[4(«i fc )|S?]} I ^0 a.s. 
1 k=\ 



for any admissible strategy u. In particular, 



liminfL7-(u) > E essinfE[4)(i<)|So 

7"— »<*> L mgUo 



a.s. 



a 



Proof. We begin by noting that 
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~ £ {E[4 («*) | Sjfl - E [€*(«*) I Sri } < 7 I ess sup |E[4 («) 1 9" k ] - E [4(«) | 



A-=l 



^E 



es8sup|E[^(M)|gS]-E[^o(«)|So] 

«gU 



a.s. 



by the ergodic theorem. Similarly, 



±£ {E[4(«*M-4(%)} <i£esssup|E[4(«)|S?]-4(") 



A = l 



^E 



esssup|E[4(«)|SS' 

mGUq 



*o(«)| 



Therefore, using Lemma I47T1 and Assumption 2 of Theorem l2.6l the first statement 
of the Corollary follows by letting n — > °° and m — > — °o. 
For the second statement, it suffices to note that 



- £E[4(%)|S?] > - £essinfE[4(«)|S 



A-=l 



by the ergodic theorem and Assumption 3 of Theorem l2.6l 



L* a.s. 



□ 



As was explained in the introduction, a pathwise optimal strategy could easily be 
obtained of one can prove "ergodic tower property" of the form 



i£{4("*)-E[4("*)|y ,*]} 



>0 a.s. 



k=l 



Corollary 14. 2l establishes just such a property, but where the <7-field Vo k lS replaced 
by the larger c-field Sj°- This yields a lower bound on the asymptotic loss, but it is 
far from clear that one can choose a Vo.A-adapted strategy that attains this bound. 

Therefore, what remains is to show that there exists an admissible strategy u* that 
attains the lower bound in Corollary 14. 21 A promising candidate is the mean-optimal 
strategy u. Unfortunately, we are not able to prove pathwise optimality of the mean- 
optimal strategy in the general setting of Theorem 12.61 However, we will obtain a 
pathwise optimal strategy u* by a judicious modification of the mean-optimal strat- 
egy u. The key idea is the following "uniform" version of the martingale conver- 
gence theorem, which we prove following Neveu [23, Lemma V-2-9]. 



Lemma 4.3. The following holds: 



essinf E[^o(u)|y_fco] 



-» essinfE[£o(«)|y- 



a.s. and in L 



Proof. Using the construction of the essential supremum as in the proof of Lemma 
we can choose for each < k < °° a countable family U c _ k C U-k.o sucn that 
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essinfE[4(w)|y_to] = inf E[4(w)|y_t ] a.s., 

and a countable family Ug C Uo such that 

essinfE[4)0)m ooo] = inf E\e {u)\^-o. } a.s. 
ueV «eU§ 

For every < £ < °°, choose an arbitrary ordering (t/f )„ 6 jj of the elements of the 
countable set CFt U (Uq PI U_fr.o)- Then we clearly have 

M k := essinf E[4)(«)|y-fco] = min inf E[e (U?)\\}- k0 ] a.s. 

«€0_ito 0</<AneN 



and 



M:=essinfE[4(")|y-oco] = inf inf E[4([//")|y—o] a.s. 

»eUo ' 0</<ooneN 



Our aim is to prove that M k — > M a.s. and in L l as k — > «>. 

We begin by noting that \M k \ < E[A|y_^o]- Therefore, the sequence (M J fc)t>o is 
uniformly integrable. Moreover, (Mfc)fc>o is a supermartingale with respect to the 
filtration (^-k,o)k>0 : indeed, we can easily compute 



E[M t+ i|y_i, ]<E min inf E[4>(I// n )|y-t-i.o] 

-0</<AnGN 



Thus M^ — > M„„ a.s. and in L by the martingale convergence theorem for some 
random variable Moo. We must now show that Moo = M a.s. Note that 

Moo = limM* < limE[i (f/f)|¥-*,o] = E[lo{Uf)\^-^ ] a.s. 

for every n G N and < / < °°, so M„o < M a.s. To complete the proof, it therefore 
suffices to show that E[M„o] = E[M]. 

To this end, define for N 6 N and < k < °° 

M N k = min minE^o^niy^o]. 

l<NAkn<N 

As (Mj )j;>o is again a supermartingale, clearly E[M^ ] is doubly nonincreasing in k 
and N. The exchange of limits is therefore permitted, so that 

E[Moo] = lim lim E[Mf 1 = lim lim E[Mf ] = E[M]. 

This completes the proof. □ 
Corollary 4.4. Suppose that Assumption 3 of Theorem \2.6\ holds. Then 

E[4(wA)|Sr]°^^essinfE[£ (M)|So] inL 1 . 

«€Uq 
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Proof. Define u^ = % o T~ k G U_£o> so that 

E[4(c*)|sr]°r-* = E[^o(flOISo]- 

By stationarity and the definition of u, we have 

E[E[4(%)|SJ]]=E[E[4(%)|y_ ii0 ]] 

S essinfE[4(«)|y_ i0 ] 
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Therefore, by Lemma 14731 we have 



limsupE[E[£ o («,0|So]] <E essinfE[£ («) 

jfe->oo L «eu 



On the other hand, note that 



E[4>(fl*)|So] >essinfE[4( M )|So] a.s. 

kGUq 



Using Assumption 3, we therefore have 



limsup 
This completes the proof. 



E[4(%)|So]-essinfE[4(w)|9o' 



<0. 



□ 



We are now in the position to construct the pathwise optimal strategy u*. By 
Corollary 14.41 we can choose a (nonrandom) sequence k n 'f °° such that 



E[^(fifc)|Sfa]or-*^essME[£o(«)|S51 a.s. 



Let us define 



u\ = 5^ o r* *" for k n <k< k n+ i, n e N. 
Then clearly u* = («?)£>! is an admissible strategy. 
Lemma 4.5. Suppose that the assumptions ofTheorem \2.6\ hold. Then 



Proof. By construction, 



lim Lt(u*) =L* a.s. 



E[4(4)|Sr]oT- fc ^essrnfE[4(M)|So] a.s. 

iiGUq 



Moreover, 

Therefore, by Maker's generalized ergodic theorem |fl9l Corollary 10.8] 



sup|E[4(a£)|s;r]or-*| <e[a|Sq] eL 1 . 

jfc>i 
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I 

T 



-£E[4(m|)|S^]^E essinfE[4>(K)|So: 



= L* a.s. 



hGUo 

Thus L T (u*) -> L* a.s. as T -> °° by Corollary |4~21 □ 
The proof of Theorem |2.6| is now complete. Indeed, if u is admissible, then 



liminf{L r (u)-L r (u*)} = liminfL r (u)-L* >0 a.s. 
by Lemma |431 and Corollary 14.21 so u* is pathwise optimal. 



4.2 Proof of Corollary BT71 

The prove pathwise optimality, it suffices to show Lt(vl) —> L* a.s. 

Lemma 4.6. Under the assumptions ofCorollarv \2.1l\ the mean-optimal strategy u 
(Lemma \2.5j satisfies Lr(u) — ► L* fl.i. as T — > °°. 



Proof. By the definition of u and Lemma 1431 

E[4(^)|y ,d°7 , "*-^ !! >essirrfE^o(M)|y_ 00i o] a.s. 

«eU 

Therefore, the third part of Assumption 2 of Corollary |2.1 ll implies that 

E[4(ii it )|y-^]or-*-^e88infE[^o(«)|V-- l o] a.s. 
But by Assumption 3 of Corollary 12.1 H and stationarity, we obtain 
E[4(%) 1 9~] o ^ essinfE[£ («) | V— .o] a.s. 



Moreover, we have 



sup |E[4(«jt)|Sr] ° T~ k \ < E[A|Sq] G L 1 . 



Maker's generalized ergodic theorem |fl9l Corollary 10.8] therefore yields 



1 
T 



±^E[4(^)|Sr] essME[f («)|y— ,0 



k=\ 



uGUq 



= L* a.s. 



As the assumptions of Corollary 12. 1 1 l imply those of Theorem l2.6l the result as well 
as pathwise optimality of u now follow from Corollary 14. 21 □ 
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The proof of the Theorem is once again based on a variant of the "ergodic tower 
property" described in the introduction. In the present setting, the result follows 
rather easily from the conditional weak mixing assumption. 

Lemma 4.7. Suppose that the assumption of Theorem \2.1 3\ holds. Then 



r £{4("*)-E[4(M*)|y-~,*]} ^0 inL 1 
1 k=\ 

for every admissible strategy u. 

Proof. Define If (u) = i%(u) o T k for u <E U. We begin by noting that 



E 



k=\ 



Suppose that m < n. Then by stationarity and as u is admissible 

E[^( M „)C("m)] = E[# ("« T-") {l$(u„ o T-'") o r-(»-'")}] 
<E 



ess sup |e$V) {#(«) o r-("-*)}|y__,o] 



We can therefore estimate 



E 



i r \ 2-i r, r n-i 

1 v-< T M/ - \ 



k=l 



=1*=0 



esssup|E[^(«') {£™(u)oT- k }\y^,o} 
u,u'eV 



k=0 

2 r_1 
<- V E 

1 k=0 



esssup|E[^( M ') {i3 r («)or-*}|V_- 1 o] 

u.m'gUo 



esssu P |E[C(«') W(«)or-*}|y_„, ] 

H,«'eUo 



By the uniform conditional mixing assumption, it follows that 

:(). 



lim limsup 



i yfc=l 



On the other hand, note that 



sup 



£ {4(«a0 - E[4(«,)|y-^]} - - E 



/<=! 



/<=! 



<E[2A1 



Af- 



A>Mj 



The result now follows by applying the triangle inequality. 



□ 
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Corollary 4.8. Under the assumption ofTheorem \2.13\ we have 



->■ for every £ > 



L T {u)~L* < -£ 
for every admissible strategy u. 

Proof. Let u be any admissible strategy. Then by Lemma l4~7l 

Mu)-^£E[4(w*)|¥-=o,*]^>0 inL 1 . 
1 k=i 



On the other hand, note that 



- £E[4(«t)|y— ,*] > r EessinfE[4(«)|y— jt] 



-)• L* in L 



£ =1 «eU t 



by the ergodic theorem. The result follows directly. □ 

In view of Corollary 14.81 in order to establish weak pathwise optimality of u it 
evidently suffices to prove that u satisfies the ergodic theorem Lj (u) — > L* in L 1 . 
However, most of the work was already done in the proof of Theorem l2.6l 

Lemma 4.9. Under the assumption ofTheorem \2.13\ Lj(u) — > L* in L 1 . 

Proof. By the definition of u, we have 

E[tk(uk)Wk]° T ~ k < essinfE[4(«)|y_ A , ]+^- 1 a.s. 



Therefore, by Lemma l4~3l we obtain 



limsupE[E[4(«t)|y— .t]or-*] <E 

ifc->oo 



essinfE[4(«)|y— .,o] 
ueV 



On the other hand, 

E[4(%)|V-=c,jfe] o 7~* > essinfE[4(«)|V-o„,o] a.s. 

for all k 6 N. It follows that 



UEVq 



lim sup 

k— >o« 



E[4(^)|y-^] o T~ k -essinfE[4(«)|V_„,o] 

«eU 



limsupE[E[4(«<r) 



}oT- k ]-E 



essinfE^oMIV— ,o] 

uGUq 



<0. 



Therefore, by Maker's generalized ergodic theorem |[T9l Corollary 10.8] 



£E[4(«*)iy— ,*] 



^L* inL 1 . 



k=\ 
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The result now follows using Lemma |4~7l 

Combining Corollary 14. 8l and Lemma |4~9l completes the proof of Theorem[2 
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4.4 Proof of Theorem \2.16\ 

The implication 1 => 2 of Theorem 12. 161 follows immediately from Theorem 1 1.8 1 
In the following, we will prove the converse implication 2 1: that is, we will 
show that if (i2,23,P,r) is not conditionally weak mixing relative to y, then we 
can construct a bounded loss function I with some finite decision space U for which 
there exists no weakly pathwise optimal strategy. 

We begin by providing a "diagonal" characterization of conditional weak mixing. 

Lemma 4.10. (Q,15,P,T) is conditionally weak mixing relative to Z if and only if 
1 T 

- y £\E[{hoT- k }h\Z]-E[hoT- k \Z]E[h\Z}\^>-0 in L 1 
T k=i 

for every h G L 2 , provided that Z C T 1 Z. 

Proof. It suffices to show that if the equation display in the lemma holds, then 
(Q , 33 , P, 7") is conditionally weak mixing relative to Z. To this end, let us fix h G Lr 
and denote by the class of all functions g E L 2 such that 

i^|Efeor-*}ft|Z]-E[gor-*|Z]E[ft|Z]|^>0 inL 1 . 
1 k=\ 

Clearly srf is closed linear subspace of L 2 . Note that stf certainly contains every 
random variable of the form hlg o T m or 1# o T" 1 for m G Z and B G Z. Therefore, 
the closed linear span K of all such random variables is included in stf ' . On the other 
hand, suppose that g G K 1 . Then for every k G Z, we have 

E[E[{goT- k }h\Z] l B ]=E[g {h\ B oT k }} =0 

for all B G Z. It follows that E[{g o T~ k } h\Z) = a.s. for all k G N. Similarly, we 
find that E[gor _ *|Z] = a.s. for all k G N. Thus evidently also. Therefore, 

.2/ contains K © = L 2 , and the proof is complete. □ 

In the remainder of this section, we suppose that (£2,23,P, T) is not conditionally 
weakly mixing relative to y. By Lemma l4.101 there is a function h G L 2 such that 

> e >0 



limsupE 



^Z\E[{H<>T- k }H\y} 



( = 1 
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where H := h — E[h\^]. By approximation in L 2 , we may clearly assume without 
loss of generality that h takes values in [0, 1], so that H takes values in [—1, 1]. We 
will fix such a function in the sequel, and consider the loss function 



where we initially choose decisions u E [—1,1] (the decision space will be dis- 
cretized at the end of the proof as required by Theorem 12. 16l l. We claim that for 
the loss function I there exists no weakly pathwise optimal strategy. This will be 
proved by a randomization procedure that will be explained presently. 
In the following ([0, 1], J) denotes the unit interval with its Borel d-field. 

Lemma 4.11. Suppose that (i2,23,P) is a standard probability space. Then there 
exists a (y ®J)-measurable map l : Q x [0, 1] — > £2 such that 



for any bounded (H-)rneasurable function X : Q — > M. 

Proof. As (12,23,?) is a standard probability space, this is |fl9l Lemma 3.22] to- 
gether with the existence of regular conditional probabilities lfl9l Theorem 6.3] . □ 

Consider the quantity 



£(u,co) = uH(a>) 




P-a.e. (0 G £2. 




Then we can compute 




m,n=\ 



In particular, using the invariance of y, we have 




>e £ |E[{/for»}{/for»}|y]| 



m,«=l 



n= l m=\ 
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E 



>E 



^Z(T-k)\Ei{HoT- k }H\y) 



T- 



k=0 
[T/2\ 



1 L'/^J 

— £ \E[{HoT- k }H\n 



k=0 



By our choice of H, it follows that 



limsupE[(A^) 2 ] > — 

r^oo 16 

for some A = Xo £ [0, 1]. Define 

u k (co) =H{T k i{(Q,Xo)). 

Then is ^-measurable for all k (and is therefore admissible if we choose, for the 
time being, the continuous decision space U = [-1,1]), andLr(u) = A?. Moreover, 



2 2 

— <limsupE[(L r (u)) 2 ] < hlimsupP 

16 t^oo 64 r-s-oo 



L T (u) > 

implies that we may assume without loss of generality that 



limsupP 



L r (u) < 



lim sup P 



Lr(u) < -- 



>0 



(if this is not the case, simply substitute u for u in the following). But note that 
the strategy u defined by = for all k is mean-optimal (indeed, E[^(«)|y] = 
«E[//|y] o T k = for all u by construction). Thus evidently 



lim sup P 



L T {u)-L T (u) < -- 



>0, 



so u is not weakly pathwise optimal. It follows from Lemma l2.12l that no weakly 
pathwise optimal strategy can exist if we choose the decision space U = [—1,1]. 

To complete the proof of Theorem |2.161 it remains to show that this conclusion 
remains valid if we replace U = [— 1,1] by some finite set. This is easily attained by 
discretization, however. Indeed, let U = {ke/16 : k = — [16/eJ,..., |16/eJ}, and 
construct a new strategy u' such that u' k equals the value of (which takes values 
in [—1,1]) rounded to the nearest element of U . Clearly u and u' both take values in 
the finite set U, and we have |Lr(u) — Lz-fu')! < e/16. Therefore, 



lim sup P 



L T (u')-L T (u)<-^ 



>0, 



and it follows again by Lemma [2.12| that no weakly pathwise optimal strategy exists. 



42 

4.5 Proof of Theorem [3.751 



Ramon van Handel 



By stationarity, we can rewrite the conditional absolute regularity property as 

||P[(**)*>o G-pC—.-nVy^. J-P[(Xt)k>o G -IV— .-111™ ^>0 inL 1 



I TV 



Using a simple truncation argument (as the loss is dominated in L 1 ), this implies 



^>0 inL 1 . 



ess sup |E[Z(k,Xo)|X-oo,-„ V y _«,.«,] -E[/( M ,X )|y_oo „] | 

hGU 

If only we could replace y_oo.oo by y_«,,o m this expression, all the assumptions of 
Theorem 12 . 61 would follow immediately. Unfortunately, it is not immediately obvi- 
ous whether this replacement is possible without additional assumptions. 

Remark 4.12. In general, it is not clear whether a conditional ^-automorphism rel- 
ative to V-oo.oo is necessarily a conditional /^-automorphism relative to ^--ci- In 
this context, it is interesting to note that the corresponding property does hold for 
conditional weak mixing. We briefly sketch the proof. Suppose that (Q,H,P,T) is 
conditionally weakly mixing relative to y_oo,oo. We claim that then also 



1 



£ |E[{/o T~ k } gty^o] - E[/o r-*|y_ >0 ]Efe|V- 



-> in L 



k=l 



for every f,g G L 2 . Indeed, the conclusion is clearly true whenever / is y_o<yr 
measurable for some n G Z. By approximation in L 2 , the conclusion holds whenever 
/ is y_oo.oo-measurable, and it therefore suffices to consider / G L 2 Cy_oo,oo) x . But in 
this case we have E[/o T'^-^^] =E[/oJ-*|y_ oo>0 ] =0 for all k, and 



i£|E[{/or-*}g[y_», ]| < if|E[{/or-*}g|y_ 



4=1 



A-=l 



->0 inL 1 



by Jensen's inequality and the conditional weak mixing property relative to y_<x>,oo. 

As we cannot directly replace y_oo,oo by y_«,,o, we take an alternative approach. 
We begin by noting that, using the conditional absolute regularity property as de- 
scribed above, we obtain the following trivial adaptation of Corollary 14. 21 

Lemma 4.13. Under the assumptions of Theorem \3.15\ we have 



Y,{l(u k ,X k )-E[l{ Uk ,X k )\y_^}}-^>0 a.s. 



k=l 



for any admissible strategy u. 

We will now proceed to replace y_oo,oo by y_oo,jt in Lemma l4.13l To this end, we 
use the additional property established in (27] Proposition 3.9]: 
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P[(X*)*<o G • |V— ,o] ~ P[(Xk)k<0 e • |V— ,»] a.s. 

Theorem 13 . 1 41 implies that the past tail d-field [} n X-cc, n is P[ - 1^— ,o] -trivial a.s. 
(cf. l33l ). Thus a standard argument ||2T1 Theorem III. 14. 10] yields 



|P[(^)t<„ g • |V— n] -P[(x k ) k <„ e ■ |y_ 



I TV 



->0 inL 1 



Therefore, by stationarity and a simple truncation argument, we have 



-S>0 inL 



esssup|E[Z( M ,X )|y— ,„]-E[/( M ,X )|y- 

kGUq 



This yields the following consequence. 

Corollary 4.14. Under the assumptions of Theorem \3.15\ we have 
1 T 

- £{/(«t,X»)-Ep(« t A)|y_» it ]}-^0 a.s. 
1 k=l 

for any admissible strategy u. In particular, 



liminfL r (u) >E essinfE[4(M)|V— ,o] 

L ueVo 



L* a.s. 



Proof (Sketch). Following almost verbatim the proof of Lemma |4~T1 one can prove 

1 T T—± 

-£{E[/(« fc ,J4)|y^]-E[/(«*,Ait)|y^ J ^,]}^->0 a.s. 
1 k=\ 

for any ceN. On the other hand, we have 



lim sup 



^{E[l{u k ,X k )\M-^ k+ r]-ni(u k ,X k )\^ taa ]} 



1 T 



^ r 1 ™ j Lesssup|E[/(«,X t )|V— ,*+,-] -E[/(h,X*)|^_ 



k=\ ueV k 



E 



esssup|E[/(M,X )|y-oc,,-]-E[Z(M,Xo)|y- 

weU 



a.s. 



by the ergodic theorem. It was shown above that the latter quantity converges to zero 
as r — > °°, and the result now follows using Lemma 14.131 □ 

The remainder of the proof of Theorem l3.15l is identical to that of Theorem l2.6l 
modulo trivial modifications, and is therefore omitted. 



44 

Acknowledgment 



Ramon van Handel 



This work was partially supported by NSF grant DMS-1005575. 



References 



1 . Algoet, P.H. : The strong law of large numbers for sequential decisions under uncertainty. IEEE 
Trans. Inform. Theory 40(3), 609-633 (1994) 

2. Bellow, A., Losert, V.: The weighted pointwise ergodic theorem and the individual ergodic 
theorem along subsequences. Trans. Amer. Math. Soc. 288(1), 307-345 (1985) 

3. Berend, D., Bergelson, V.: Mixing sequences in Hilbert spaces. Proc. Amer. Math. Soc. 98(2), 
239-246 (1986) 

4. Cappe, O., Moulines, E., Ryden, T: Inference in hidden Markov models. Springer, New York 
(2005) 

5. Chigansky, P., van Handel, R.: A complete solution to Blackwell's unique ergodicity problem 
for hidden Markov chains. Ann. Appl. Probab. 20(6), 2318-2345 (2010) 

6. Conze, J. P.: Convergence des moyennes ergodiques pour des sous-suites. In: Contributions au 
calcul des probabilites, pp. 7-15. Bull. Soc. Math. France, Mem. No. 35. Soc. Math. France, 
Paris (1973) 

7. Crisan, D., Rozovskii, B. (eds.): The Oxford handbook of nonlinear filtering. Oxford Univer- 
sity Press, Oxford (2011) 

8. Del Moral, P., Ledoux, M.: Convergence of empirical processes for interacting particle sys- 
tems with applications to nonlinear filtering. J. Theoret. Probab. 13(1), 225-257 (2000) 

9. Dellacherie, C, Meyer, P.A.: Probabilities and potential. C. North-Holland, Amsterdam 
(1988) 

10. Dudley, R.M.: Uniform central limit theorems. Cambridge University Press, Cambridge 
(1999) 

11. Etemadi, N.: An elementary proof of the strong law of large numbers. Z. Wahrsch. Verw. 
Gebiete 55(1), 119-122 (1981) 

12. Grothendieck, A.: Produits tensoriels topologiques et espaces nucleaires. Mem. Amer. Math. 
Soc. 1955(16), 140(1955) 

13. Halmos, PR.: In general a measure preserving transformation is mixing. Ann. of Math. (2) 
45, 786-792 (1944) 

14. van Handel, R.: The stability of conditional Markov processes and Markov chains in random 
environments. Ann. Probab. 37(5), 1876-1925 (2009) 

15. van Handel, R.: Uniform time average consistency of Monte Carlo particle filters. Stochastic 
Process. Appl. 119(11), 3835-3861 (2009) 

16. van Handel, R.: On the exchange of intersection and supremum of o"-fields in filtering theory. 
Israel J. Math. (2012). To appear 

17. van Handel, R.: The universal Glivenko-Cantelli property. Probab. Th. Rel. Fields (2012). To 
appear 

18. Hoffmann- J0rgensen, J.: Uniform convergence of martingales. In: Probability in Banach 
spaces, 7 (Oberwolfach, 1988), Progr. Probab., vol. 21, pp. 127-137. Birkhauser Boston, 
Boston, MA (1990) 

19. Kallenberg, O.: Foundations of modern probability, second edn. Springer- Verlag, New York 
(2002) 

20. Kunita, H.: Asymptotic behavior of the nonlinear filtering errors of Markov processes. J. 
Multivariate Anal. 1, 365-393 (1971) 

21. Lindvall, T: Lectures on the coupling method. Dover Publications Inc., Mineola, NY (2002). 
Corrected reprint of the 1 992 original 



Ergodicity, Decisions, and Partial Information 



45 



22. Meyn, S., Tweedie, R.L.: Markov chains and stochastic stability, second edn. Cambridge 
University Press, Cambridge (2009) 

23. Neveu, J.: Discrete-parameter martingales. North-Holland, Amsterdam (1975) 

24. Nobel, A.B.: On optimal sequential prediction for general processes. IEEE Trans. Inform. 
Theory 49(1), 83-98 (2003) 

25. Pollard, D.: A user's guide to measure theoretic probability. Cambridge University Press, 
Cambridge (2002) 

26. Rudolph, D.J.: Pointwise and L 1 mixing relative to a sub-sigma algebra. Illinois J. Math. 
48(2), 505-517 (2004) 

27. Tong, X.T., van Handel, R.: Conditional ergodicity in infinite dimension (2012). Preprint 

28. Totoki, H: On a class of special flows. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete 15, 
157-167 (1970) 

29. van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer- 
Verlag, New York (1996) 

30. Volkonskii, V.A., Rozanov, Y.A.: Some limit theorems for random functions. I. Theor. Proba- 
bility Appl. 4, 178-197 (1959) 

31. Walters, P.: An introduction to ergodic theory. Springer- Verlag, New York (1982) 

32. Weissman, T., Merhav, N.: Universal prediction of random binary sequences in a noisy envi- 
ronment. Ann. Appl. Probab. 14(1), 54-89 (2004) 

33. von Weizsacker, H: Exchanging the order of taking suprema and countable intersections of 
(T-algebras. Ann. Inst. H. Poincare Sect. B (N.S.) 19(1), 91-100 (1983) 



